Notes: molecular simulation (was HPC in Sciences): OpenPBS Tips

These technical notes may only apply to OSCAR 3.0 package.

Node crashes causing jobs staying at R(running) state
1. Identify the job number.
2. sudo rm /var/spool/pbs/server_priv/jobs/[that number].*
3. sudo killall -9 pbs_server
4. sudo service pbs_server start
Jobs not released and staying at E(exiting) state
The same procedure mentioned previously still applies.
Pbs server CPU usage stays 100%
sudo service pbs_server restart
Also check my previous article about recovering a dead node.
Nodes which are manually cleared/on-lined cannot run the job submitted before activation.
1. Identify the queue number.
2. sudo qrun [this number]
Shutting down too many nodes and leaving them in "down" state will kill your pbs_server (it will also mark active node as "down"), you need to mark this powered-off nodes as "off-line" nodes.
Always check the health of pbs_server/maui
Adjust the pbs_server log level, or it will eat up your disk.
When a node can only responds ping, it's experiencing a filesystem problem, you need a further reboot and inspection to determine the reason. If a job was running on it, forget it and restart and follow the steps of "jobs staying at R(running) state".

Another similar problem is when you logged in the node and it gave you such:

switcher/1.0.10(85):ERROR:102: Tcl command execution failed: if { 
$have_switcher && ! $am_removing } {
  process_switcher_output "announce" [exec switcher --announce]

  # Now invoke the switcher perl script to get a list of the modules
  # that need to be loaded.  If we get a non-empty string back, load
  # them.  Only do this if we're loading the module.

  process_switcher_output "load" [exec switcher --show-exec]
}

This also indicates that the node is experiencing a filesystem problem.

Notes: molecular simulation (was HPC in Sciences)

Sunday, January 02, 2005

OpenPBS Tips

These technical notes may only apply to OSCAR 3.0 package.

No comments:

Categories