Friday, December 17, 2010
amber11 pmemd + LAM-7,1,4/torque
Wednesday, October 14, 2009
E1618 on PowerEdge 2950
| E1618 | PS # Predictive | Power supply voltage is out of acceptable range; specified power supply is improperly installed or faulty. |
Monday, June 15, 2009
Fedora 8 x86_64 server, OSCAR 5.x, ia32 nodes oh my!
Please leave a comment blow if anyone is reading this.
First of all, my server configuration is a Dell PE2950 with 8G memory and 6 network ports. The first time I tried to install OSCAR, I was using CentOS 5.3 x86_64 with OSCAR 6.0.3-1. However the OSCAR 6.0.3-1 is not stable enough for me to install things without errors. No luck for Fedora 9 with OSCAR 6.0.3-1 either, thanks to many perl package obstacles. The first time I fell-back to OSCAR 5.1+, I was trying Fedora 9 under a false impression of compatibility but unfortunately it's not. So here I come Fedora 8 x86_64! My goal is to install a cluster with ia32 (i386) nodes on a x86_64 server utilizing all the 6 network ports.Here are some notes:
- Since I installed the Fedora 8 from the disk+network without any modification (the only option I had was I chose to install developers software option), I only have 2G swap partition space, I need to use a swapfile.
# dd if=/dev/zero of=/swapfile0 bs=1M count=8192 # mkswap /swapfile0; chmod 600 /swapfile0; swapon /swapfile0# echo "/swapfile0 swap swap defaults 0 0" >> /etc/fstab# chkconfig NetworkManager off(in fc8 after the text-mode installation, the default setting is off)# chkconfig network on(in fc8 after the text-mode installation, the default setting is on)# chkconfig iptables off# chkconfig ip6tables off- Make sure you really turned off the iptables and NetworkManager, the GUI tools might be deceptive.
# perl -pi -e 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config# yum -y install bridge-utils gnuplot grace# brctl addbr br0# brctl addif br0 eth1; brctl addif br0 eth2; brctl addif br0 eth3; brctl addif br0 eth4; brctl addif br0 eth5- edit /etc/sysconfig/network-scripts/ifcfg-eth[1-5] accordingly with BOOTPROTO=none, ONBOOT=yes, BRIDGE=br0, NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no
- edit /etc/sysconfig/network-scripts/ifcfg-eth0 to set NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no
- create /etc/sysconfig/network-scripts/ifcfg-br0 with contents of
DEVICE=br0 TYPE=Bridge BOOTPROTO=static IPADDR=10.0.0.254 NETMASK=255.255.255.0 ONBOOT=yes NOZEROCONF=yes DELAY=0 STP=no NM_CONTROLLED=no IPV6INIT=no IPV6_AUTOCONF=no
- create /etc/modprobe.d/disableipv6 with one line:
install ipv6 /bin/true
- edit /etc/sysconfig/network to modify the hostname and use the command "hostname" to change current setting, too.
- Add oscar_server nfs_oscar pbs_oscar for 10.0.0.254 into the file /etc/hosts
- Add all the names for this server into the file /etc/mail/local-host-names
- Add
ALL : 10.0.0.0/255.255.255.0,localhost,the-external-ip-for-your-server sshd : 10.0.0.0/255.255.255.0,.uci.edu
into the file /etc/hosts.allow . (I am from .uci.edu .) - Add
ALL : ALL EXCEPT LOCAL
into the file /etc/hosts.deny . - If httpd was installed, add
ServerAdmin yourname@yourmail.box ServerSignature Off ServerTokens Prod
into a new file of /etc/httpd/conf.d/lab.conf # yum remove NetworkManager.i386(important if you want to do full uptodate with yum.# yum update(optional, but it should be very helpful.)# reboot- I have doubts on OSCAR 6, so I chose to use OSCAR 5.x from the nightly branch, downloaded oscar-repo-common-rpms-5*nightly-*.tar.gz, oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz and oscar-repo-fc-8-i386-5*nightly-*.tar.gz
# mkdir -p /tftpboot/distro /tftpboot/oscar; tar xzfC oscar-repo-common-rpms-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-i386-5*nightly-*.tar.gz /tftpboot/oscar/# perl -pi -e 's/gpgcheck=1/gpgcheck=0/' /etc/yum.conf# yum install createrepo /tftpboot/oscar/common-rpms/yume*.rpm# yume --repo /tftpboot/oscar/common-rpms install oscar-base# rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/x86_64/os/Packages/ /tftpboot/distro/fedora-8-x86_64/- rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/i386/os/Packages/ /tftpboot/distro/fedora-8-i386/
- create /tftpboot/distro/fedora-8-i386.url with one line:
file:/tftpboot/distro/fedora-8-i386
since this file won't be generated automatically. (not sure if fedora-8-x86_64.url was generated or bundled with packages or not, check it anyway.) # cd /opt/oscar/lib# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/i386.newkey --repo /tftpboot/distro/fedora-8-i386# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-i386# yume --prepare --repo /tftpboot/distro/fedora-8-i386# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/x86_64.newkey --repo /tftpboot/distro/fedora-8-x86_64# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-x86_64# rm /tftpboot/distro/*/*torque* /tftpboot/distro/*/openmpi*rpm# yume --prepare --repo /tftpboot/distro/fedora-8-x86_64- Make sure that /tftpboot/distro/fedora-8-x86_64.url, /tftpboot/distro/fedora-8-i386.url, /tftpboot/oscar/fc-8-i386.url and /tftpboot/oscar/fc-8-x86_64.url exist and contain correct URL information.
# perl -pi -e 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config# yum install perl-AppConfig# perl -pi -e 's/\/usr\/sbin\/netbootmgr/\/usr\/bin\/netbootmgr/' /opt/oscar/scripts/oscar_wizard# cd /opt/oscar; env OSCAR_VERBOSE=3 ./install_cluster br0- uncheck the loghost from the package list because it is not working in fc8.
- normally you don't need to worry about the following configuration step, I didn't change anything but ganglia seems to be a good place to do some modification on the default setting.
- install server packages
- Do remember that everytime you run "./install_cluster" your previous setting via install_cluster will be lost. Redo the setting and installing the package. To avoid that, after "install server packages" step is done, if for some reason you exited the ./install_cluster command, you are free to use oscar_wizard. Here is how you do it: "1. use a new shell window other than your old ./install_cluster shell. 2. cd /opt/oscar/scripts; ./oscar_wizard"
- revise /opt/oscar/oscarsamples/scsi.disk for SATA/SCSI nodes configuration.
- revise /opt/oscar/oscarsamples/fc-8-i386.rpmlist
- build an image for i386 (ia32) nodes
- Use another shell window to modify the image:
# chroot chroot /var/lib/systemimager/images/i386image chkconfig avahi-daemon off - Define a first node for test. Remember to change the oscarnode to node since the string 'oscarnode' is too long.
- Click 'setup networking', start collecting the MAC address and assign IPs, click 'Stop collecting MACs', click 'Configure DHCP Server' then click 'Setup Network Boot', wait for 'okay' popup.
- Open a new shell window to modify /tftpboot/kernel and /tftp/initrd.img. These two files are mistakenly linked to x86_64 versions, this is because the image architecture was labelled x86_64, apparently it's not what we wanted.
# rm /tftpboot/kernel /tftp/initrd.img # cd /tftpboot # cp -p /usr/share/systemimager/boot/i386/standard/kernel install-kernel-i386 # cp -p /usr/share/systemimager/boot/i386/standard/initrd.img install-initrd-i386.img # ln -s install-kernel-i386 kernel # ln -s install-initrd-i386.img initrd.img
- Disable floppy, Hyperthreading, keyboard missing warning... etc.
- Make sure the node set to PXE boot as the first priority in the boot list.
- Boot and install (I presume it went smoothly.)
- After the reboot, make sure the network interface is correct. Login and ping 10.0.0.254 to see if the network is working. I have a failure percentage of more than 10% failure that eth0 and eth1 are wrongly ordered and I could not alias them to the correct order, damn old hardware.
- In my case, I fired up netbootmgr and set the network failure node to "install"
- reboot the failure nodes and check the BIOS setting, make sure the RAM setting is at "default", all my failure nodes have wrong RAM specification settings.
- Another problem that the node didn't install is that the network is too busy, in this case, just do a reboot will do.
- After first node got installed successfully, click the 'Complete Cluster Setup' button.
- Don't forget to click the test cluster button
- Now install 80 more nodes.
- post installation modification:
perl -pi -e 's/tmpwatch -x \/tmp/tmpwatch -m -x \/tmp/;s/10d/100d/' /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch
- post installation modification:
# cpush /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch /etc/cron.daily/tmpwatch - post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
# for g03 (added by mengjuei hsieh) kernel.randomize_va_space = 0
- post installation modification:
# env -u DISPLAY cexec sysctl -w kernel.randomize_va_space=0 - post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
# suggestion from ibm redbook (added by mjhsieh) vm.overcommit_memory = 1
- post installation modification:
# env -u DISPLAY cexec 'sysctl -w vm.overcommit_memory=1' - post installation modification:
# cpush /var/lib/systemimager/images/i386image/etc/sysctl.conf /etc/sysctl.conf - post installation modification:
# env -u DISPLAY cexec :1-81 cp -pr /home/software/compilers-i386 /opt/compilers - post installation modification, this is for compilers:
# yume --installroot /var/lib/systemimager/images/i386image install compat-libstdc++-33 # cexec yume -y install compat-libstdc++-33 # echo "compat-libstdc++-33" >> /opt/oscar/oscarsamples/fc-8-i386.rpmlist
- For some reason I found that pbs_mom might also be running on the cluster/headnode even if I disabled it at the beginning, remember to check it out.
- For some reason torque-docs was not installed, fixing this by doing
# yume install torque-docs, make sure you didn't install this with FC8's torque-docs. - I have 79 pentium 4 nodes and 2 nodes with old xeon (ia32), so I decided to create a PBS queue for pentium 4, so that I can allocate pentium 4 nodes: use the command "
# qmgr < P4.setting", where P4.setting file contains:# # Create queues and set their attributes. # # # Create and define queue workq # create queue p4 set queue p4 queue_type = Execution set queue p4 resources_max.cput = 10000:00:00 set queue p4 resources_max.ncpus = 4 set queue p4 resources_max.nodect = 2 set queue p4 resources_max.walltime = 10000:00:00 set queue p4 resources_min.cput = 00:00:01 set queue p4 resources_min.ncpus = 1 set queue p4 resources_min.nodect = 1 set queue p4 resources_min.walltime = 00:00:01 set queue p4 resources_default.cput = 10000:00:00 set queue p4 resources_default.ncpus = 1 set queue p4 resources_default.nodect = 1 set queue p4 resources_default.walltime = 10000:00:00 set queue p4 resources_available.nodect = 2 set queue p4 enabled = True set queue p4 started = True # # Assign nodes to queue p4 # set node node1.local,node2.local,node3.local,node4.local,node5.local,node6.local,node7.local,node8.local,node9.local,node10.local,node11.local,node12.local,node13.local,node14.local,node15.local,node16.local,node17.local,node18.local,node19.local,node20.local,node21.local,node22.local,node23.local,node24.local,node25.local,node26.local,node27.local,node28.local,node29.local,node30.local,node31.local,node32.local,node33.local,node34.local,node35.local,node36.local,node37.local,node38.local,node39.local,node40.local,node41.local,node42.local,node43.local,node44.local,node45.local,node46.local,node47.local,node48.local,node49.local,node50.local,node51.local,node52.local,node53.local,node54.local,node55.local,node56.local,node57.local,node58.local,node59.local,node60.local,node61.local,node62.local,node63.local,node64.local,node65.local,node66.local,node67.local,node68.local,node69.local,node70.local,node71.local,node72.local,node73.local,node74.local,node75.local,node76.local,node77.local,node78.local,node79.local properties+=p4
# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/free-nodes -o /usr/local/bin/free-nodes; chmod +x /usr/local/bin/free-nodes# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/qterm -o /usr/local/bin/qterm; chmod +x /usr/local/bin/qterm
Here are the final result:
.------------------------------------------------------------.
| * OpenPBS NODES REPORT (v0.052) * (by mjhsieh) |
`------------------------------------------------------------'
Queue Free CPU Nodes Nodes Down
Name Nodes in use Defined or Offline
---------- ----- ----- ----- -----
p4 79 0 79 0
--------------------------------------------------
Summary: 81 0 81 0
There are 0 job(s) queued.
Further commands
- Example to update node image:
# yume --installroot /var/lib/systemimager/images/i386image update - No, previous command cannot be replaced by
# chroot /var/lib/systemimager/images/i386imageplus some other - Example to install/update nodes package:
# cexec yume install -y vim-enhanced
Tuesday, May 26, 2009
g03 oops
# for g03 kernel.randomize_va_space = 0(skipped, the commands are easy part, the hard part is to recover this from my own rusty brain.
The profile file, beware of that if /usr/tmp is not working and gives you shits like
"Erroneous write during file extend. write -1 instead of 4096", check the writability permission on $GAUSS_SCRDIR. Or just set GAUSS_SCRDIR to $HOME. export PATH="/home/software/g03/g03:${PATH}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/home/software/g03/g03:/home/software/g03/gv/lib"
export g03root="/home/software/g03/g03"
export GAUSS_EXEDIR="/home/software/g03/g03"
export GAUSS_SCRDIR="/usr/tmp"
export G03BASIS="/home/software/g03/g03/basis/"
export GAUSS_ARCHDIR="/home/software/g03/g03/arch"
export GV_DIR="/home/software/g03/gv"
alias gv=$GV_DIR'/gview'
alias gview=$GV_DIR'/gview'
My installation on the cluster:
$ find /home/software/g03 -type d /home/software/g03 /home/software/g03/g03 /home/software/g03/g03/arch /home/software/g03/g03/basis /home/software/g03/g03/bsd /home/software/g03/g03/tests /home/software/g03/g03/tests/com /home/software/g03/g03/tests/i386 /home/software/g03/g03/tests/rs6k /home/software/g03/gv /home/software/g03/gv/bin /home/software/g03/gv/data /home/software/g03/gv/data/biofrags /home/software/g03/gv/data/biofrags/fragments /home/software/g03/gv/data/elements /home/software/g03/gv/data/elements/bitmaps /home/software/g03/gv/data/elements/fragments /home/software/g03/gv/data/fonts /home/software/g03/gv/data/rgroups /home/software/g03/gv/data/rgroups/bitmaps /home/software/g03/gv/data/rgroups/fragments /home/software/g03/gv/data/rings /home/software/g03/gv/data/rings/bitmaps /home/software/g03/gv/data/rings/fragments /home/software/g03/gv/help /home/software/g03/gv/help/icons /home/software/g03/gv/help/pix /home/software/g03/gv/help/refs /home/software/g03/gv/lib
Thursday, October 16, 2008
Are you busy to receive the submissions?
areyoubusy(){
njobs=$(pbsnodes -x $1 | grep \<jobs\> | sed -e 's/.*<jobs>//;s/<\/jobs\>.*$//' \
| awk -F ',' 'END{print NF}')
if [ $njobs -gt 0 ]; then
echo "yes"
else
echo "no"
fi
}
# usage: areyoubusy node1.local
Monday, September 29, 2008
Some testing result on pmemd scaling and parallel computation
Message-ID: <6003f3300809291827p13fcdeew94133ee69f206f1b@mail.gmail.com> Date: Mon, 29 Sep 2008 18:27:37 -0700 From: "Mengjuei Hsieh" To: developers Subject: Some testing result on pmemd scaling and parallel computation Here is some recap on the JAC benchmark performance on different parallel options I did this weekend. We were trying to explore the options for network connection with jumbo frame (also known as large MTU, mtu=9000 in linux) gigabit ethernet local network to see if we can replace the previous parallel computing solution of connecting two machines directly with an ethernet cable (we called it sub-pairs to reflect the fact that by doing so, the machines will be grouped in pairs). The reason is obvious, grouping computing nodes in pairs is not an efficient way to work with nor to manage the nodes. We tested with the NetPipe benchmark to measure the performance of a gigabit ethernet with or without jumbo frame, the benchmark is consistent with general wisdom and references on the internet or on the literature. I thought we could utilize more bandwidth with jumbo frame ethernet. First, I tested the scaling of amber 9 pmemd with lam/mpi or mpich on jumbo frame ethernet. The configurations of the testing environment look like this: Two identical dell poweredge 1950, each comes with 2 intel xeon 5140 woodcrest duo-core processors, 4MB cache, 2GB RAM. Shared memory interconnect / MPICH-1.2.6 / LAM-MPI 7.1.4 Intel Fortran 90 compiler, Intel MKL The results of the parallel performance are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.329 -- 2 0.628 95 (SMP) 4 1.094 83 (SMP) 4 0.965 73 (TCP, 1+1+1+1) 4 0.819 62 (SMP/TCP, 2+2) 8 0.987 37 (SMP/TCP, 4+4) This does not meet the definition of "scaling" therefore the network traffic was also measured and I found that in the case of the network communication, only 30% of the bandwidth is recorded. For some sidenotes, these are under the parameter of at least P4_SOCKBUFSIZE=131072 (mpich) and net.core.rmem_max=131072 net.core.wmem_max=131072, similar results have been observed under lam-mpi rpi_tcp_short=131072. Further test on direct connection pairs shows that the measurement is similar. Therefore the benchmark fell back to amber 8 pmemd, which is the original program we had in the sub-pair configuration. the results of the parallel performance with amber 8 pmemd are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.203 -- 2 0.391 96 (SMP) 4 0.465 57 (SMP) 4 0.457 56 (SMP/TCP, 2+2) 8 0.680 42 (SMP/TCP, 4+4) Less efficient amber 8 pmemd makes the scaling factor of 4+4cpus parallel computation look better, but the performance is definitely not better. Similar results were observed on directly connected pairs. The interest of this exploration then turns to the scaling of the AMBER 10 pmemd, and the results are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.411 -- 4 1.329 80 (SMP) 8 1.137 35 (SMP/TCP, 4+4) At this point, I can say is don't expect anything too interesting from gigabit ethernet performance. This conclusion is consistent with observation from Dr. Duke and Dr. Walker. A further benchmark has been done for Amber10 pmemd on a dual quad-cores intel xeon E5410 machine (dell PE1950, 2.3GMhz, 6MB cache, 2G RAM): ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms (on the same machine, SMP mode) #procs nsec/day scaling, % 1 0.434 -- 2 0.815 94 4 1.464 84 6 1.964 75 8 2.274 65 That's all. AMBER 10 pmemd rocks. Bests, -- Mengjuei
Monday, December 31, 2007
Thursday, November 23, 2006
OSCAR5: failed to recognize the SAS interface before mkfs
modprobe mptsas" in the /var/lib/systemimager/scripts/pre-install file. However it's also okay to hack the master script. Later I will post my way to assign an internal IP for extra ethernet port.
Saturday, November 18, 2006
Simple NAT rules (tested on oscar 5)
*nat :PREROUTING ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A POSTROUTING -o eth1 -j MASQUERADE COMMIT # *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A FORWARD -d 192.168.0.0/255.255.255.0 -i eth0 -j ACCEPT -A FORWARD -s 192.168.0.0/255.255.255.0 -i eth0 -j ACCEPT COMMIT
Monday, July 31, 2006
Backing-up OSCAR 5 Data
tar cf /backup/oscar.tar /opt/kernel_picker /etc/systemimager /etc/dhcpd.conf /etc/profile.d/00-modules.* /etc/profile.d/c3.* /etc/profile.d/oscar_home.* /etc/profile.d/ssh-oscar.* /etc/systeminstaller /opt/env-switcher* /opt/lam* /opt/maui /opt/modules /opt/mpich* /opt/pbs /opt/perl-Qt /opt/sync_files /tftpboot/initrd.img /tftpboot/kernel /tftpboot/pxelinux.* /usr/lib/systemimager /usr/lib/systeminstaller /var/lib/ganglia /var/lib/mysql /var/lib/oscar /var/lib/systemimager /var/log/systemimager /var/spool/pbs /opt/oscar /etc/httpd/conf.d/ganglia.conf /etc/ssh
Backing-up OSCAR 3 Nodes by Hand
rsync -avxHS --numeric-ids --delete --exclude 'lost+found' --exclude '/tmp/*' --exclude '/var/spool/clientmqueue/*' --exclude '/var/spool/pbs/aux/*' --exclude '/var/spool/pbs/mom_logs/*' --exclude '/var/spool/pbs/mom_priv/jobs/*' --exclude '/var/spool/pbs/spool/*' --exclude '/var/spool/pbs/undelivered/*' --exclude '*.pid' --exclude '*.lock' --exclude 'swapfile0' / 10.0.0.254:/usr/cluster/nodes/oscar3/root/and also this:
rsync -avxHS --numeric-ids --delete --exclude 'lost+found' --delete-excluded /boot/ 10.0.0.254:/usr/cluster/nodes/oscar3/boot/Done. (Update: please be careful about the directories that has been ignored... I don't trust the exclude expression.)
Reimaging OSCAR 3 Nodes by Hand
Hard drive failure and irreparable filesystem are pretty common during the operation the cluster, the first thought that comes to us always is to re-imaging the pre-stored image from the server. However in OSCAR (at least in v3 as well as v4) you need to flush the OpenPBS setting or restart the PBS server. That causes a lot of problem if you have a lot of jobs running on the rest of the computing nodes. Here is my notes to reimage OSCAR 3 nodes manually:
- pxe boot into fc2 installation untill load into gui(stage2)
- alt-F2 enter the shell
- To erase the previous partition, do:
parted -s -- /dev/sda mklabel msdos parted -s -- /dev/sda print
- Then build the partition table:
parted -s -- /dev/sda mkpart primary ext2 0 24 parted -s -- /dev/sda mkpart extended 24 76319 parted -s -- /dev/sda mkpart logical 24 2072 parted -s -- /dev/sda mkpart logical ext2 2072 76319
- Format the partition:
mke2fs /dev/sda6 mkswap /dev/sda5 mke2fs /dev/sda1 mkdir -p /a/boot mount /dev/sda6 /a/ mount /dev/sda1 /a/boot/
- FC2 boot disk doesn't have rsync, retrieving from the server:
scp 10.0.0.254:/usr1/cluster/nodes/rayl3/root/usr/bin/rsync /usr/bin/
- Copy:
rsync -avxHS --numeric-ids --exclude 'lost+found' \ 10.0.0.254:/usr/cluster/nodes/oscar3/root/ /a/ rsync -avxHS --numeric-ids --exclude 'lost+found' \ 10.0.0.254:/usr/cluster/nodes/oscar3/boot/ /a/boot/ - Create swapfile since we have extra swap requirement for gaussian and didn't enlarge the swap size at the first place:
dd if=/dev/zero of=/a/var/vm/swapfile0 bs=1M count=1024 mkswap /a/var/vm/swapfile0
- Modify the files /etc/sysconfig/network, /etc/sysconfig/network-scripts/ifcfg-eth0, /etc/pfilter.cmds, /etc/pfilter.src for ip address
- Restoring GRUB
chroot /a/ grub root (hd0,0) setup (hd0) quit exit
- Reboot, Change the BIOS boot sequence if necessary.
Sunday, July 23, 2006
OSCAR 4 with direct links
Direct Links
For computing nodes that have more than one GbE port, it might be a good idea to do direct connection between two nodes and taking advantage of the network speed for parallel computation. What I did was to set up IP addresses for the direct link ports, for instance, 192.168.0.1 for odd numbered nodes, 192.168.0.2 for even numbered nodes.
PBS/Torque
In order to access the specified resource through OpenPBS/Torque, we need to create a customized queue for the paired nodes since it's a peer to peer direct link. What I did was to use qmgr and import these commands:
# I copied the resources settings from the workq of OSCAR 4, # it might be different from the default workq of OSCAR 5 create queue subpair01 set queue subpair01 queue_type = Execution set queue subpair01 resources_max.cput = 10000:00:00 set queue subpair01 resources_max.ncpus = 8 set queue subpair01 resources_max.nodect = 2 set queue subpair01 resources_max.walltime = 10000:00:00 set queue subpair01 resources_min.cput = 00:00:01 set queue subpair01 resources_min.ncpus = 1 set queue subpair01 resources_min.nodect = 1 set queue subpair01 resources_min.walltime = 00:00:01 set queue subpair01 resources_default.cput = 10000:00:00 set queue subpair01 resources_default.ncpus = 1 set queue subpair01 resources_default.nodect = 1 set queue subpair01 resources_default.walltime = 10000:00:00 set queue subpair01 resources_available.nodect = 2 set queue subpair01 enabled = True set queue subpair01 started = True set node node1.local,node2.local properties+=subpair01Actually, you can save the commands into a file and use
gmgr < ./commandsBy the way, before going into the parallel computation on direct link, make sure your ssh won't croak the signature stuffs. Make sure to use c3 to have it done (Like:
cexec :1-2 ssh 192.168.0.1 uptime; cexec :1-2 ssh 192.168.0.2 uptime). (Actually, there are a lot of potential problem about ssh, I believe the signature problems are simplified by OSCAR installation.)
MPICH
Here I use a PBS script to submit my MPICH jobs, this example is for AMBER jac benchmark. Please read the script and see how I specify MPI_HOST to tell MPICH the routing of message traffic.
#!/bin/sh
#PBS -N "MPICHjob"
#PBS -q subpair01
#PBS -l nodes=2:subpair01:ppn=8
#PBS -S /bin/sh
#PBS -r n
cd /home/demo/MPICH_SUBPAIR
# customized machinefile
cat > machine.subpairN << EOF
192.168.0.1:4
192.168.0.2:4
EOF
# Tell mpich to run through the direct link
export MPI_HOST=`/sbin/ifconfig eth1 | grep "inet addr:" \
| sed -e 's/inet addr://' | awk '{print $1}'`
# Recommended by Dave Case in the Amber mail list
export P4_SOCKBUFSIZE=524288
# Run
source /opt/intel/fc/9.0/bin/ifortvars.sh
/home/software/mpich_net/bin/mpirun -machinefile ./machine.subpairN -np 8 \
/home/software/amber9/exe/pmemd.MPICH_NET -O -i mdin.amber9 -c \
inpcrd.equil -p prmtop -o /tmp/output.txt -x /dev/null -r \
/dev/null
# Data Retreival
mv /tmp/output.txt output.pmemd9.MPICH_SUBPAIR
LAM/MPI
Then these are the script for LAM/MPI, you can see I still need to specify the routing of traffic. Also the first node defined by lamboot may not be the same node that PBS send you to.
#!/bin/sh
#PBS -N "LAMMPIjob"
#PBS -q subpair01
#PBS -l nodes=2:subpair01:ppn=8
#PBS -S /bin/sh
#PBS -r n
cd /home/demo/LAM
# customized machinefile
cat > machine.subpairN << EOF
192.168.0.1 cpu=4
192.168.0.2 cpu=4
EOF
# if we don't specify -ssi boot rsh, lam will use boot tm and
# the IPs provided by pbs that uses the oscar lan.
/opt/lam/bin/lamboot -ssi boot rsh -ssi rsh_agent "ssh" -v machine.subpairN
# Run
source /opt/intel/fc/9.0/bin/ifortvars.sh
/opt/lam/bin/mpirun -ssi rpi sysv -np 8 \
./sander9.LAM -O -i mdin -c inpcrd.equil -p prmtop \
-o /tmp/output.txt -x /tmp/trajectory.crd -r /tmp/restart.rst
/opt/lam/bin/lamhalt >& /dev/null
# Data Retreival
# becuase the master node is n0, not the first node of pbs
ssh 192.168.0.1 mv /tmp/output.txt /tmp/trajectory.crd /tmp/restart.rst /home/demo/LAM
Friday, July 21, 2006
OSCAR/branches/branch-5-0
One thing worth taking notice of is that the branch 5.0 is still changing. In order to keep it updated, I could use
svn diff -r revision | less to check the change and upgrade the OSCAR installation to current stable branch. Practically, it would be much easier if I just download the nightly build just do a diff and update the package respectively (by hand, of course).
Monday, February 20, 2006
OSCAR 4 lam-mpi and ifort
This is a note for ifort with lam-mpi under OSCAR cluster. You can either compile lam by yourself:
% env CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort ./configure \ --prefix=/home/lammpiOr, use "ifort -assume 2underscores" as your FC for your applications.
By the way, the run-time ssi option SYSV would be good for general cases since the communication goes through shared memory on the same machine and through tcp on different machines.
Also check this out:
#!/bin/sh
#PBS -N "LAMMPIjob"
#PBS -q workq
#PBS -l nodes=1:ppn=2
#PBS -S /bin/sh
# To check if this is forked by qsub
# Not necessary
if [ -z "$PBS_ENVIRONMENT" -a "$SSH_TTY" ]
then
# go to some directory
cd /home/demo/LAM
else
# go to some directory
cd $PBS_O_WORKDIR
fi
# using the lazy default tm module (pbs)
/opt/lam-7.0.6/bin/lamboot
/opt/lam-7.0.6/bin/mpirun -ssi rpi sysv -np 2 \
./sander.LAM -O -i mdin -c inpcrd.equil -p prmtop \
-o /tmp/output.txt -x /tmp/trajectory.crd -r /tmp/restart.rst
/opt/lam-7.0.6/bin/lamhalt
# Data Retreival
mv /tmp/output.txt /tmp/trajectory.crd /tmp/restart.rst .
Monday, November 21, 2005
Epilogue Script in OpenPBS
This script may not be able to correctly clean the orphan processes you want to remove. I recommend giving LAM/MPI a try instead of MPICH.
To clean-up the process left after jobs exiting the nodes, an epilogue script is a convenient choice. Here is an example (although this example is not compatible with all scenarios) for Torque in OSCAR 4.x package:
#!/bin/sh
# Please notice that ALL processes from $USER will be killed (!!!)
echo '--------------------------------------'
echo Running PBS epilogue script
# Set key variables
USER=$2
NODEFILE=/var/spool/pbs/aux/$1
PPN=`/bin/sort $NODEFILE | /usr/bin/uniq | /usr/bin/wc -l`
if [ "$PPN" = "1" ]; then
# only one processor used
echo Done.
#su $USER -c "skill -v -9 -u $USER"
else
# more than one cpu used
echo Killing processes of user $USER on the batch nodes
for node in `cat $NODEFILE`
do
echo Doing node $node
su $USER -c "ssh -a -k -n -x $node skill -v -9 -u $USER"
done
echo Done.
fi
Monday, September 19, 2005
AMBER 8 PMEMD on Hyper-Threading Machines
The configuration in this benchmark is shown here:
- OSCAR 4.2 (pre-beta version) with Fedora Core 3 Linux
- Intel Xeon 2.8G with 1MB cache
- Direct GbE connection between two machines
- LAMMPI/MPICH version of PMEMD from AMBER 8 distribution
- According to the document provided by Dr. Duke, P4_SOCKBUFSIZE is set to 524288 for MPICH (and /etc/sysctl.conf on the nodes has to be changed accordingly.).
Figure 1. The JAC benchmark on different node/processor combinations. Denoted "spreading nodes first" means to distribute the threads to as many node as possible. In this plot we can see the performance of 8 thread (4 threads on each node) without hyper-threading is actually much worse than 4 threads (2 threads on each node). It also implies that hyper-threading is good at stressing test.
Figure 2. This plot shows that if we populate one node first then populate the other node, we can see linear scaling (when ignoring 2-threads calculation).
The scaling of JAC calculation seems to be fine probably because the footprint of PMEMD and the simulation system is small. Perhaps a benchmark on bigger system (JAC simulation is on DHFR protein with 159 amino acid residues, explicit water representation.) is needed. I also tried Jumbo Frames (GbE that MTU > 1500) setting, it does slow down the LAM calculation as other report expected.
Conclusion:
- Hyper-threading does help.
- Scaling on Hyper-threading machine can be linear depends on how you look at it.
References:
- Dr. Bob Duke, Using Intel compilers (ifc8) with PMEMD
- Joint Amber/Charmm DHFR benchmark, the information can be found at AMBER benchmark website.
- Gelb Research Group at Washington University in St. Louis, "Fjord" - a linux cluster
Thursday, September 01, 2005
Preventing Users From Not Using PBS Loging on Computing Nodes
Normally an experienced beowulf cluster administrator would probably suggest people not to login the computing nodes directly. However we (I myself is a user, too.) tend to connect to the computing nodes and run something on them without going through the scheduler (or resource allocator we might say.). All right, just a small job and you don't want to run it on the server because the server is often busy. The administrator would probably be mad because he or she cannot let the resource fairly accessible to the users. Therefore following command (but only for OSCAR cluster or other cluster with PBS/Torque installed) I guess is for administrators to recommend their users to run:
$ qsub -I -N "interactivejob" -S /bin/tcsh -q workq -l nodes=1:ppn=1This will let users login the computing nodes through the scheduler.
Do we think the users will follow the rules and giving up logining into the nodes? No we are not stupid. A civilized but lazy way is to beg the users in the /etc/motd:
Please! Please do not ssh into the node! We beg you!Of course this doesn't work on hackers. Unfortunately people usually think of themselves as hackers. So if we put following local.csh script in the /etc/profile.d/ on the nodes, you can stop the manual login thru ssh:
if ( ! $?PBS_ENVIRONMENT ) then
if ( $?SSH_TTY && `whoami` != "root" ) then
echo; echo please stop login the node thru ssh; echo
logout
endif
endifOr, as Jenna (in #oscar-cluster @ FreeNode) pointed out, use local.sh for bash/sh users:[ -z "$PBS_ENVIRONMENT" -a "$SSH_TTY" -a `whoami` != "root" ] && logoutThis kind of design will not prevent users from using cexec, mpi, qsub or pbsqsh. However it doesn't guarantee users are absolutely not able to ssh to the nodes. If users intend to do so, the admin should use more civilized communication skills, not go into a technical fight.
As this problem going away, now we face another problem. People will just do their stuff on the server because they can't login into the nodes. And qsub is such a hassle a genius won't use it. Screw you guys, I am going home. 凸
Wednesday, August 24, 2005
Booting Nodes with Memtest86/Memtest86+ Under OSCAR 4
OSCAR 5 will bundle with netbootmgr, it utilizes the beauty of pxeboot and manage the boot option from the server. Great Feature!
This is the way to add memtest86/memtest86+ by directly modifying (or hacking I should say.) the systemimager's image. Of course the better way is to find the kernel switcher/systeminstaller script and modify it in the OSCAR 4 packages, if I have time, really. My kludge is to add an entry of kernel in /var/lib/systemimager/images/IDEimage/etc/systemconfig/systemconfig.conf (if the name of your OSCAR image is IDEimage). Also you need to copy the memtest86 booting image to the /var/lib/systemimager/images/IDEimage/boot directory.
So my systemconfig.conf looks like this, I colored the section I modified as red:
# systemconfig.conf written by systeminstaller.
CONFIGBOOT = YES
CONFIGRD = YES
[BOOT]
ROOTDEV = /dev/hda6
BOOTDEV = /dev/hda
DEFAULTBOOT = 2.6.12-1.1372_F
[KERNEL0]
PATH = /boot/vmlinuz-2.6.12-1.1372_FC3smp
LABEL = 2.6.12-1.1372_F
[KERNEL1]
PATH = /boot/vmlinuz-2.6.12-1.1372_FC3
LABEL = 2.6.12-1.1372_F
[KERNEL2]
PATH = /boot/memtest86+-1.60.bin.gz
LABEL = memtest86+
[KERNEL3]
PATH = /boot/memtest86-3.2.bin.gz
LABEL = memtest86
And you can find memtest86 at http://www.memtest86.com/ and memtest86+ at http://www.memtest.org/. Well, if you are really bored, you can modify grub splash.xpm.gz (the background image of grub kernel loader menu), just make sure it's 14 indexed colors 640x480 xpm also need to translate the color name for example gray25 to hex rgb (#404040). You can replace the original /var/lib/systemimager/images/IDEimage/boot/grub/splash.xpm.gz with your masterpiece. For example like this....Tuesday, August 23, 2005
Ethernet Bridging Problems with OSCAR 4.2β
For some reason, I am using a special but not rare configuration for our OSCAR clusters like this:
The reasons I designed it that way were quite simple.
- I want larger bandwidth for each computing node.
- Because I can.
Here let me write done the configuration for someone who might find it interesting. First of all, my linux server (head node) has 6 GbE network ports and I already used the first one connecting to the Internet. I just needed to find the way to bind these 5 ethernet ports to the same ip and share the bandwidth load. In Fedora Core 2/3 Linux, it's very easy to do. Just set the members of the ethernet bridge to ONBOOT=yes, IPADDR=0.0.0.0 and BRIDGE=br0 (this br0 is an virtual device for bridge.) In the mean time, the ifcfg-br0 setting would be like this:
DEVICE=br0 TYPE=Bridge BOOTPROTO=static IPADDR=10.0.0.254 NETMASK=255.255.255.0 ONBOOT=yes DELAY=0 STP=onFor more information, check out the FAQ of ethernet bridge. Now I presumed the readers of this nonsense blog entry already have installed oscar in /opt/oscar, but not yet started "./install_cluster br0", you can modify the script /opt/oscar/packages/pfilter/scripts like this: (if you like the patch file more ... )
--- post_clients.orig 2005-08-23 20:35:18.000000000 -0700 +++ post_clients 2005-08-23 20:34:44.000000000 -0700 @@ -176,7 +176,7 @@ # the server and every compute node trust each other -trusted %oscar_server% %nodes% +trusted %oscar_server% %nodes% $on_interface open multicast # for ganglia #Or, if you already did "./install_cluster br0", just modify /etc/pfilter.conf and add br0 in the line of "
trusted %oscar_server% %nodes%" and then issue "serivce pfilter restart". That's it, your computing nodes can connect through the bridge interface now.2005-08-23 03:55:13