Sunday, January 02, 2005

OpenPBS Tips

These technical notes may only apply to OSCAR 3.0 package.

  • Node crashes causing jobs staying at R(running) state
    1. Identify the job number.
    2. sudo rm /var/spool/pbs/server_priv/jobs/[that number].*
    3. sudo killall -9 pbs_server
    4. sudo service pbs_server start
  • Jobs not released and staying at E(exiting) state
    The same procedure mentioned previously still applies.
  • Pbs server CPU usage stays 100%
    sudo service pbs_server restart
  • Also check my previous article about recovering a dead node.
  • Nodes which are manually cleared/on-lined cannot run the job submitted before activation.
      If you can re-submit these jobs, that would be great, but if not, try this:
    1. Identify the queue number.
    2. sudo qrun [this number]
  • Shutting down too many nodes and leaving them in "down" state will kill your pbs_server (it will also mark active node as "down"), you need to mark this powered-off nodes as "off-line" nodes.
  • Always check the health of pbs_server/maui
  • Adjust the pbs_server log level, or it will eat up your disk.
  • When a node can only responds ping, it's experiencing a filesystem problem, you need a further reboot and inspection to determine the reason. If a job was running on it, forget it and restart and follow the steps of "jobs staying at R(running) state".
  • Another similar problem is when you logged in the node and it gave you such:
    switcher/1.0.10(85):ERROR:102: Tcl command execution failed: if { 
    $have_switcher && ! $am_removing } {
      process_switcher_output "announce" [exec switcher --announce]
    
      # Now invoke the switcher perl script to get a list of the modules
      # that need to be loaded.  If we get a non-empty string back, load
      # them.  Only do this if we're loading the module.
    
      process_switcher_output "load" [exec switcher --show-exec]
    }
    This also indicates that the node is experiencing a filesystem problem.

No comments: