Parallel prologue and epilogue in Grid Engine

Posted by ajdecon on Server Fault See other posts from Server Fault or by ajdecon
Published on 2011-03-02T18:43:48Z Indexed on 2011/03/06 0:12 UTC
Read the original article Hit count: 588

Filed under:
|
|

We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.

In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.

In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.

Any suggestions?

© Server Fault or respective owner

Related posts about clustering

Related posts about gridengine