Monday, July 1, 2013

PHP Cron Controller

I've just finished working on a distributed web application that required fourteen different cron jobs. As this project was developing and released to beta testers, we did not have the ability to modify when those cron jobs executed or even which ones executed since the user installed them on their server. Users also complained that it was the hardest part of the installation. Just before public release of the project, we eliminated one of the cron jobs but because beta users had it setup in their system, the only thing we could do was wipe the file so it could still run without error.

With a traditional cron scheduler, I was not able to easily control what jobs fired when with a simple application update. Also, cron setup seemed to be difficult for some users to manage, even though I gave them the exact values to put in. Part of the problem was that there were so many jobs and when a schedule value changed, some users didn't see the change in value. This ment that jobs that should be executed once a day were getting executed every hour. While this is fixable in the PHP script itself, I'd rather have it done correctly the first time.

The solution I've come up with for this next project is a fire and forget cron controller written in PHP. There is one standard cron job that fires every minute. That job is a controller that has a list of every job with a schedule for it. For flexibility, I wanted to use similar notation as cron so to get a job to run every five minutes I'd use */5 and to get it to run five after the hour it would simply be 5. For simplicity, I put all the jobs into an array.
$jobs = array(
    'metrics-1' => array(
        'minInt'    => '*/5',
        'hourInt'   => '*',
        'url'       => '/app/cron/metrics.php?block=1'
    ),
    'metrics-2' => array(
        'minInt'    => '*/5',
        'hourInt'   => '*',
        'offset'    => 1,
        'url'       => '/app/cron/metrics.php?block=2'
    )
);
The offset allows a job to easily have a one minute delay, in this case so it will run every five minutes but one minute after the other job. Here's the main potion of the controller.
$batch = array();
foreach($jobs as $name => $data) {
    $min = (int)date('i');
    $hr = (int)date('H');
    $runJob = false;
    $mI = $data['minInt'];
    $mH = $data['hourInt'];
    
    if(isset($data['enable']) && $data['enable'] == 0) {
        continue;
    }
    
    // evaluate minute intervals
    if(strpos($mI, '*') !== false) {
        if(strpos($mI, '/') !== false) {
            $i = str_replace('*/','',$mI);
            if(isset($data['offset'])) {
                $min = $min - (int)$data['offset'];
            }
            if(($min % $i) == 0) {
                $runJob = true;
            }
        } else {
            $runJob = true;
        }
    } elseif(strpos($mI, ',') !== false) {
        $i = explode(',', $mI);
        if(in_array($min, $i)) {
            $runJob = true;
        }
    } else {
        $i = (int)$mI;
        if($min == $i) {
            $runJob = true;
        }
    }
    
    // evalute hour intervals
    if($runJob === true) {
        if(strpos($mH, '*') !== false) {
            if(strpos($mH, '/') !== false) {
                $i = str_replace('*/','',$mH);
                if(($hr % $i) != 0) {
                    $runJob = false;
                }
            }
        } else {
            $i = (int)$mH;
            if($hr != $i) {
                $runJob = false;
            }
        }
    }
    
    if($runJob === true) {
        echo date('M d, Y H:i:s')."\t[$name]\tStarting\n";
        $batch[] = $data['url'];
    }
}
Since I only needed the minute and hour setting, that's all this controller will handle.

Once the controller determines which jobs to run at that time, they are called via cURL. The advantage of curl is that the jobs can be called and the output from those jobs can be ignored.
$hArr = array();
foreach($batch as $url) {
    $h = curl_init();
    
    curl_setopt($h, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($h, CURLOPT_SSL_VERIFYPEER, false);

    curl_setopt($h, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.82 Safari/537.1");
    curl_setopt($h, CURLOPT_ENCODING, 'gzip,deflate');
    curl_setopt($h, CURLOPT_RETURNTRANSFER, false);
    curl_setopt($h, CURLOPT_AUTOREFERER, true);
    curl_setopt($h, CURLOPT_HEADER, false);
    curl_setopt($h, CURLOPT_URL,$url);
    curl_setopt($h, CURLOPT_TIMEOUT, 10);

    $hArr[] = $h;
}
$mh = curl_multi_init();
foreach($hArr as $h) {
    curl_multi_add_handle($mh,$h);
}

$running = null;
do {
    curl_multi_exec($mh,$running);
} while($running > 0);
Since we don't care about the output, we can set the timeout to ten seconds. I felt ten seconds was safer so that DNS could be resolved and the server could be sure to start the request. Inside the actual jobs, the time limit is set to what ever is needed for that job.

Because the actual job are called via standard web services, the usual issues of what user is executing the cron job and server document root are all avoided. The issue of logging for these jobs is harder but also easily handled with a logging class that does error handling (I'll discuss that in a later post).

Update

After deploying this to a couple of dozen servers, I've come to the conclusion that some servers don't deal well with the fire and forget method. What appears to be happening is that after the controller stops looking for a response, the server kills the script that was called. For safety I've increased the CURLOPT_TIMEOUT to 300 to ensure that the scripts are not killed by the server. This only seems to be happening on one hosting company that I know of so far so I assume it's a special setup they have.

No comments:

Post a Comment