Writing cron jobs in PHP

Introduction

Where I work we have a web API that while simple to use, can not be used by all our clients that want to intergrate with our system. For these clients we sometimes offer a FTP solution that allows clients to upload files, which we then process periodically using scripts. These scripts are writen in PHP. This article assumes you are using some *nix variant that you can run php on, such as Linux, or MacOS X for example.

Why PHP?

PHP is a natural choice for this job, much to the dismay of our system administrator and resident bash guru. These scripts talk to our web API, and manipulate CSV or XML files. PHP has the functionality to handle this readily available, and the programmers who maintain these are generally 'webbies'.

The script

Writing a PHP script to be run over and over, without a web server or even a browser with someone looking at it has it's differences. You want the script just do it's work quietly, and not say anything unless something is wrong. You don't want it even logging the fact that it has run to a log file, if the period is short enough. We have one script that runs every 30 seconds!

Here I will go through making a short script that runs very often and works on larger files. This will demonstrate a few problems these scripts face, namely:

Loading XML and processing it

So lets start with a simple script that reads any xml files it finds in a directory and prints out a particular xml element. Loading a CSV file is not too different if a little more messy, having to match column indices to fields.

#! /usr/bin/env php
<?php

libxml_use_internal_errors( true );

$in_dir = 'incoming/';

$files = glob( $in_dir . '*.xml' );

foreach( $files as $file ) {

    $xml = simplexml_load_file( $file );

    if (!$xml) {
        $errors = libxml_get_errors();

        foreach ($errors as $error) {
            print_r( $error );
        }

        libxml_clear_errors();

    } else {
        if( isset( $xml->name ) ) {
            echo "Name: " . $xml->name . "\n";
        }
    }
}

The first line is known as a she-bang or hashbang amongst other names.

Next we tell libxml to not print any xml errors immediatly, so we can control how these are reported. For now the reporting is very simple. Other than that there is not anything that most PHP programmers have not seen before.

We can test this out easily from the command line by running it, however we can also repeatively run it easily using the 'watch' command. This command is usally used to run another program over and over, and watch the output.

watch ./load_and_print.php

By default this will run the script every 20 seconds. For Mac users see here.

This way you can leave it running for a while, and use another terminal to try dropping a file in the 'incoming' directory. If you do so you will notice that the output does not change unless you add a file. We're "processing" each file, but redoing them each time. We need to keep track of which files have been done.

The easiest way I found is to simply move the file to a different directory.

We can add the following to the script:

$out_dir = 'processed/';	// Just after the $in_dir line

$moved = $out_dir . basename( $file ); // At the beginning of the foreach loop
rename( $file, $moved );

$xml = simplexml_load_file( $moved ); // change $file to $moved

Renaming the file to another directory gives you the tiniest hint to a potential gotcha. What if the file you are working on is not yet completely there yet? This is very possible when the file is arriving over the Internet.

What you may initially think that the rename function will fail if the file is not completely uploaded, and therefore still open by the FTP server. Perhaps you can test for this and not process a file that is not ready.

However the short answer is no. The rename will work fine even when the file is still being written to. What happens is that the file name is only used to open a file - once it is open and you have a file handle you don't need the name anymore. Having a file open does not stop another process (the script) from renaming it. Renaming is how files are 'moved' as well (when on the same disk).

Checking a file is ready to process

So how do we check if a file is open by another process? There is a *nix command called lsof (LiSt Open Files) that lists open files, and given a file name will list only that file if it is open.

We can add the following function:

function isFileOpen( $filename ) {
  $ret = exec( '/usr/sbin/lsof ' . escapeshellarg( $filename ) );
  if( $ret == '' ) {
    return false;
  }
  return true;
}

The directory where lsof lives may have to be changed for your particular system. Mac, and Red hat based systems (e.g. Fedora, CentOS) have it there, Ubuntu at least has it in /usr/bin/. We use this function to check if the file is open, and then if it is, we skip the file by continuing the loop. The file may be ready by the time the next run happens.

Making sure only one instance is running at a time

So what happens when the script starts but there is too much to do before it is scheduled to run again? We don't want two instances working on the same set of files. You could have the script at the beginning move all the files into a working directory and process them from there. That way when the next run happens, it either finds no files to process, or a completely new set. Another method is to use a lock file which is my preferred way as it's only a few of lines of code, means you will only ever have one instance of the script processing at one time, and not load up your server.

Here's the code, to add to the begining:

$lock_file = 'load_and_print.lock';

$fp_lock = fopen( $lock_file, 'w' );
if( !flock( $fp_lock, LOCK_EX | LOCK_NB ) ) {
    echo "Could not lock the lock file. An instance is already running.\n";
    die();
}

The lock will be released when the script exits. In this way if a particular instance is busy still processing, subsequent invocations simply exit, rather than banking up, and possibly overloading the server.

Here's the full script again:

#! /usr/bin/env php
<?php

$in_dir = 'incoming/';
$out_dir = 'processed/';
$lock_file = 'load_and_print.lock';

$fp_lock = fopen( $lock_file, 'w' );
if( !flock( $fp_lock, LOCK_EX | LOCK_NB ) ) {
    echo "Could not lock the lock file. An instance is already running.\n";
    die();
}

libxml_use_internal_errors( true );

function isFileOpen( $filename ) {
    $ret = exec( '/usr/sbin/lsof ' . escapeshellarg( $filename ) );
    if( $ret == '' ) {
        return false;
    }
    return true;
}

$files = glob( $in_dir . '*.xml' );

foreach( $files as $file ) {

    if( isFileOpen( $file ) ) {
        continue;   // We'll possibly get this next time
    }

    $moved = $out_dir . basename( $file );
    rename( $file, $moved );

    $xml = simplexml_load_file( $moved );

    if (!$xml) {
        $errors = libxml_get_errors();

        foreach ($errors as $error) {
            print_r( $error );
        }

        libxml_clear_errors();

    } else {
        // Process the file...
        if( isset( $xml->name ) ) {
            echo "Name: " . $xml->name . "\n";
        }
    }
}

Installing the script as a cron job

To run the script periodically, perhaps the easiest method is to run it as a cron job. To do this you need to edit the list using the command crontab -e which will open the list of cron jobs in your default editor. Here you can add a line similar to the following:


*/1 * * * * cd /where/the/scripts/are; ./my_script.php

This will change directory and then run the script - every minute. Changing the directory first allows you to use shorter relative file names within the script itself. See the documentation on cron for more information on how the time specifying is done. You can do all sorts of schedules. Should you want the script to run more often than one minute, one technique is to do the following:


*/1 * * * * cd /where/the/scripts/are; ./my_script.php
*/1 * * * * sleep 30; cd /where/the/scripts/are; ./my_script.php
This will run the script every 30 seconds.

The End

That's it for now. I didn't go over how we send the data to our API, which simply receives POST vars, one being encoded XML. Basically libcurl is used, but that can be the subject of another (albeit shorter) article.
Back to Top