These technical notes may only apply to OSCAR 3.0 package.
- Node crashes causing jobs staying at R(running) state
- Identify the job number.
sudo rm /var/spool/pbs/server_priv/jobs/[that number].*
sudo killall -9 pbs_server
sudo service pbs_server start
- Jobs not released and staying at E(exiting) state
The same procedure mentioned previously still applies. - Pbs server CPU usage stays 100%
sudo service pbs_server restart
- Also check my previous article about recovering a dead node.
- Nodes which are manually cleared/on-lined cannot run the job submitted before activation.
- If you can re-submit these jobs, that would be great, but if not, try this:
- Identify the queue number.
sudo qrun [this number]
- Shutting down too many nodes and leaving them in "down" state will kill your pbs_server (it will also mark active node as "down"), you need to mark this powered-off nodes as "off-line" nodes.
- Always check the health of pbs_server/maui
- Adjust the pbs_server log level, or it will eat up your disk.
- When a node can only responds ping, it's experiencing a filesystem problem, you need a further reboot and inspection to determine the reason. If a job was running on it, forget it and restart and follow the steps of "jobs staying at R(running) state".
- Another similar problem is when you logged in the node and it gave you such:
switcher/1.0.10(85):ERROR:102: Tcl command execution failed: if { $have_switcher && ! $am_removing } { process_switcher_output "announce" [exec switcher --announce] # Now invoke the switcher perl script to get a list of the modules # that need to be loaded. If we get a non-empty string back, load # them. Only do this if we're loading the module. process_switcher_output "load" [exec switcher --show-exec] }
This also indicates that the node is experiencing a filesystem problem.
No comments:
Post a Comment