This script may not be able to correctly clean the orphan processes you want to remove. I recommend giving LAM/MPI a try instead of MPICH.
To clean-up the process left after jobs exiting the nodes, an epilogue script is a convenient choice. Here is an example (although this example is not compatible with all scenarios) for Torque in OSCAR 4.x package:
#!/bin/sh
# Please notice that ALL processes from $USER will be killed (!!!)
echo '--------------------------------------'
echo Running PBS epilogue script
# Set key variables
USER=$2
NODEFILE=/var/spool/pbs/aux/$1
PPN=`/bin/sort $NODEFILE | /usr/bin/uniq | /usr/bin/wc -l`
if [ "$PPN" = "1" ]; then
   # only one processor used
   echo Done.
   #su $USER -c "skill -v -9 -u $USER"
else
   # more than one cpu used
   echo Killing processes of user $USER on the batch nodes
   for node in `cat $NODEFILE`
        do
        echo Doing node $node
        su $USER -c "ssh -a -k -n -x $node skill -v -9 -u $USER"
   done
   echo Done.
fi
 
1 comment:
Actually this epilogue is considered harmful, please just use as an example.
Post a Comment