Tuesday, August 10, 2004

Nodes Recovery

For a beowulf cluster, computing nodes do hang or freeze from time to time, especially when you, the SA (system administrator), are out of town or during the weekend or even in your vacation. So it's very important to have a standard procedure for non-administrative guys to learn how to correctly recover dead nodes.

This technical note may only apply on OSCAR 3.0 software

First of all, be sure to locate every node and label them with the host name or whatever static, like MAC address, this will help you to find the dead node. After locating the dead node, try to shutdown the nodes by pressing power button once or keeping pressing power button until shutdown if pressing once doesn't work. And if these two don't work, use the power switch of the machine to shutdown power-off.

Wait for a while to avoid the surge, press the power button or switch on the power. Normal starting up and simple automatic file system check takes about 2 minutes or more, please check the hard drive LED light for hard drive activity. After hard drive activity becomes quite, use the command ping from the head node to check whether the node is back or not.

$ ping node888.local
After getting echos from the node, this means we have a successful reboot. If you can't got a echo from the node, please give up and try to hook up monitor for further inspection. Non-SA guys can stop here now.

In order to make sure the filesystems are clean in this rebooted dead node. After you start up the dead nodes, reboot them again with:

$ sudo ssh node888.local 'shutdown -Fr now'
This command will force the filesystem to be re-checked again during the reboot, and also require sudo permission or root password.

Several minutes after the node rebooted, OpenPBS may hang due to the situation I mentioned before. Your qsub, qdel, qstat or pbsnodes commands also hang because of pbs_server stopping to work. Use command top, you can see the CPU usage of pbs_server is 100%. Just do this to stop this hanged process:

$ sudo killall -9 pbs_server
And also start the pbs_server momentarily:
$ sudo service pbs_server start
These 2 commands require sudo permission or root password.
If the message tells you [Failed], try again. More OpenPBS issues will be addressed in another technical note entry.

No comments: