Notes: molecular simulation (was HPC in Sciences): Cluster

Showing posts with label Cluster. Show all posts

Friday, December 17, 2010

amber11 pmemd + LAM-7,1,4/torque

The default optimization flag for pmemd.MPI is -fast, which causes some trouble in our cluster since the torque library doesn't like -static at all. You might already know that -fast is equivalent to "-xHOST -O3 -ipo -no-prec-div -static". My suggestion is to use "-axSTP -O3 -ipo -no-prec-div" instead. The reason for that is compatibility, -xHost also isn't a good optimization flag either. All processors in our cluster are not exactly the same, -xHost is just adding a possibility to mess with.

Wednesday, October 14, 2009

E1618 on PowerEdge 2950

From the manual:

E1618

PS # Predictive

Power supply voltage is out of acceptable range; specified power supply is improperly installed or faulty.

Monday, June 15, 2009

Fedora 8 x86_64 server, OSCAR 5.x, ia32 nodes oh my!

Please leave a comment blow if anyone is reading this.

First of all, my server configuration is a Dell PE2950 with 8G memory and 6 network ports. The first time I tried to install OSCAR, I was using CentOS 5.3 x86_64 with OSCAR 6.0.3-1. However the OSCAR 6.0.3-1 is not stable enough for me to install things without errors. No luck for Fedora 9 with OSCAR 6.0.3-1 either, thanks to many perl package obstacles. The first time I fell-back to OSCAR 5.1+, I was trying Fedora 9 under a false impression of compatibility but unfortunately it's not. So here I come Fedora 8 x86_64! My goal is to install a cluster with ia32 (i386) nodes on a x86_64 server utilizing all the 6 network ports.

Here are some notes:

Since I installed the Fedora 8 from the disk+network without any modification (the only option I had was I chose to install developers software option), I only have 2G swap partition space, I need to use a swapfile. # dd if=/dev/zero of=/swapfile0 bs=1M count=8192
# mkswap /swapfile0; chmod 600 /swapfile0; swapon /swapfile0
# echo "/swapfile0 swap swap defaults 0 0" >> /etc/fstab
# chkconfig NetworkManager off (in fc8 after the text-mode installation, the default setting is off)
# chkconfig network on (in fc8 after the text-mode installation, the default setting is on)
# chkconfig iptables off
# chkconfig ip6tables off
Make sure you really turned off the iptables and NetworkManager, the GUI tools might be deceptive.
# perl -pi -e 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
# yum -y install bridge-utils gnuplot grace
# brctl addbr br0
# brctl addif br0 eth1; brctl addif br0 eth2; brctl addif br0 eth3; brctl addif br0 eth4; brctl addif br0 eth5
edit /etc/sysconfig/network-scripts/ifcfg-eth[1-5] accordingly with BOOTPROTO=none, ONBOOT=yes, BRIDGE=br0, NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no
edit /etc/sysconfig/network-scripts/ifcfg-eth0 to set NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no

create /etc/sysconfig/network-scripts/ifcfg-br0 with contents of

DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
IPADDR=10.0.0.254
NETMASK=255.255.255.0
ONBOOT=yes
NOZEROCONF=yes
DELAY=0
STP=no
NM_CONTROLLED=no
IPV6INIT=no
IPV6_AUTOCONF=no

create /etc/modprobe.d/disableipv6 with one line:
```
install ipv6 /bin/true
```
edit /etc/sysconfig/network to modify the hostname and use the command "hostname" to change current setting, too.
Add oscar_server nfs_oscar pbs_oscar for 10.0.0.254 into the file /etc/hosts
Add all the names for this server into the file /etc/mail/local-host-names

Add

ALL  : 10.0.0.0/255.255.255.0,localhost,the-external-ip-for-your-server
sshd : 10.0.0.0/255.255.255.0,.uci.edu

into the file /etc/hosts.allow . (I am from .uci.edu .)

Add
```
ALL    : ALL EXCEPT LOCAL
```
into the file /etc/hosts.deny .
If httpd was installed, add
```
ServerAdmin yourname@yourmail.box
ServerSignature Off
ServerTokens Prod
```
into a new file of /etc/httpd/conf.d/lab.conf
# yum remove NetworkManager.i386 (important if you want to do full uptodate with yum.
# yum update (optional, but it should be very helpful.)
# reboot
I have doubts on OSCAR 6, so I chose to use OSCAR 5.x from the nightly branch, downloaded oscar-repo-common-rpms-5*nightly-*.tar.gz, oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz and oscar-repo-fc-8-i386-5*nightly-*.tar.gz
# mkdir -p /tftpboot/distro /tftpboot/oscar; tar xzfC oscar-repo-common-rpms-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-i386-5*nightly-*.tar.gz /tftpboot/oscar/
# perl -pi -e 's/gpgcheck=1/gpgcheck=0/' /etc/yum.conf
# yum install createrepo /tftpboot/oscar/common-rpms/yume*.rpm
# yume --repo /tftpboot/oscar/common-rpms install oscar-base
# rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/x86_64/os/Packages/ /tftpboot/distro/fedora-8-x86_64/

rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/i386/os/Packages/ /tftpboot/distro/fedora-8-i386/
create /tftpboot/distro/fedora-8-i386.url with one line:
```
file:/tftpboot/distro/fedora-8-i386
```
since this file won't be generated automatically. (not sure if fedora-8-x86_64.url was generated or bundled with packages or not, check it anyway.)
# cd /opt/oscar/lib
# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/i386.newkey --repo /tftpboot/distro/fedora-8-i386
# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-i386
# yume --prepare --repo /tftpboot/distro/fedora-8-i386
# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/x86_64.newkey --repo /tftpboot/distro/fedora-8-x86_64
# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-x86_64
# rm /tftpboot/distro/*/*torque* /tftpboot/distro/*/openmpi*rpm
# yume --prepare --repo /tftpboot/distro/fedora-8-x86_64
Make sure that /tftpboot/distro/fedora-8-x86_64.url, /tftpboot/distro/fedora-8-i386.url, /tftpboot/oscar/fc-8-i386.url and /tftpboot/oscar/fc-8-x86_64.url exist and contain correct URL information.
# perl -pi -e 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config
# yum install perl-AppConfig
# perl -pi -e 's/\/usr\/sbin\/netbootmgr/\/usr\/bin\/netbootmgr/' /opt/oscar/scripts/oscar_wizard
# cd /opt/oscar; env OSCAR_VERBOSE=3 ./install_cluster br0
uncheck the loghost from the package list because it is not working in fc8.
normally you don't need to worry about the following configuration step, I didn't change anything but ganglia seems to be a good place to do some modification on the default setting.
install server packages
Do remember that everytime you run "./install_cluster" your previous setting via install_cluster will be lost. Redo the setting and installing the package. To avoid that, after "install server packages" step is done, if for some reason you exited the ./install_cluster command, you are free to use oscar_wizard. Here is how you do it: "1. use a new shell window other than your old ./install_cluster shell. 2. cd /opt/oscar/scripts; ./oscar_wizard"
revise /opt/oscar/oscarsamples/scsi.disk for SATA/SCSI nodes configuration.
revise /opt/oscar/oscarsamples/fc-8-i386.rpmlist
build an image for i386 (ia32) nodes
Use another shell window to modify the image: # chroot chroot /var/lib/systemimager/images/i386image chkconfig avahi-daemon off
Define a first node for test. Remember to change the oscarnode to node since the string 'oscarnode' is too long.
Click 'setup networking', start collecting the MAC address and assign IPs, click 'Stop collecting MACs', click 'Configure DHCP Server' then click 'Setup Network Boot', wait for 'okay' popup.

Open a new shell window to modify /tftpboot/kernel and /tftp/initrd.img. These two files are mistakenly linked to x86_64 versions, this is because the image architecture was labelled x86_64, apparently it's not what we wanted.

# rm /tftpboot/kernel /tftp/initrd.img
# cd /tftpboot
# cp -p /usr/share/systemimager/boot/i386/standard/kernel install-kernel-i386
# cp -p /usr/share/systemimager/boot/i386/standard/initrd.img install-initrd-i386.img
# ln -s install-kernel-i386 kernel
# ln -s install-initrd-i386.img initrd.img

Disable floppy, Hyperthreading, keyboard missing warning... etc.
Make sure the node set to PXE boot as the first priority in the boot list.
Boot and install (I presume it went smoothly.)
After the reboot, make sure the network interface is correct. Login and ping 10.0.0.254 to see if the network is working. I have a failure percentage of more than 10% failure that eth0 and eth1 are wrongly ordered and I could not alias them to the correct order, damn old hardware.
In my case, I fired up netbootmgr and set the network failure node to "install"
reboot the failure nodes and check the BIOS setting, make sure the RAM setting is at "default", all my failure nodes have wrong RAM specification settings.
Another problem that the node didn't install is that the network is too busy, in this case, just do a reboot will do.
After first node got installed successfully, click the 'Complete Cluster Setup' button.
Don't forget to click the test cluster button
Now install 80 more nodes.

post installation modification:

perl -pi -e 's/tmpwatch -x \/tmp/tmpwatch -m -x \/tmp/;s/10d/100d/' /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch

post installation modification: # cpush /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch /etc/cron.daily/tmpwatch
post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
```
# for g03 (added by mengjuei hsieh)
kernel.randomize_va_space = 0
```
post installation modification: # env -u DISPLAY cexec sysctl -w kernel.randomize_va_space=0
post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
```
# suggestion from ibm redbook (added by mjhsieh)
vm.overcommit_memory = 1
```
post installation modification: # env -u DISPLAY cexec 'sysctl -w vm.overcommit_memory=1'
post installation modification: # cpush /var/lib/systemimager/images/i386image/etc/sysctl.conf /etc/sysctl.conf
post installation modification: # env -u DISPLAY cexec :1-81 cp -pr /home/software/compilers-i386 /opt/compilers

post installation modification, this is for compilers:

# yume --installroot /var/lib/systemimager/images/i386image install compat-libstdc++-33
# cexec yume -y install compat-libstdc++-33
# echo "compat-libstdc++-33" >> /opt/oscar/oscarsamples/fc-8-i386.rpmlist

For some reason I found that pbs_mom might also be running on the cluster/headnode even if I disabled it at the beginning, remember to check it out.
For some reason torque-docs was not installed, fixing this by doing # yume install torque-docs, make sure you didn't install this with FC8's torque-docs.

I have 79 pentium 4 nodes and 2 nodes with old xeon (ia32), so I decided to create a PBS queue for pentium 4, so that I can allocate pentium 4 nodes: use the command "# qmgr < P4.setting", where P4.setting file contains:

#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue p4
set queue p4 queue_type = Execution
set queue p4 resources_max.cput = 10000:00:00
set queue p4 resources_max.ncpus = 4
set queue p4 resources_max.nodect = 2
set queue p4 resources_max.walltime = 10000:00:00
set queue p4 resources_min.cput = 00:00:01
set queue p4 resources_min.ncpus = 1
set queue p4 resources_min.nodect = 1
set queue p4 resources_min.walltime = 00:00:01
set queue p4 resources_default.cput = 10000:00:00
set queue p4 resources_default.ncpus = 1
set queue p4 resources_default.nodect = 1
set queue p4 resources_default.walltime = 10000:00:00
set queue p4 resources_available.nodect = 2
set queue p4 enabled = True
set queue p4 started = True
#
# Assign nodes to queue p4
#
set node node1.local,node2.local,node3.local,node4.local,node5.local,node6.local,node7.local,node8.local,node9.local,node10.local,node11.local,node12.local,node13.local,node14.local,node15.local,node16.local,node17.local,node18.local,node19.local,node20.local,node21.local,node22.local,node23.local,node24.local,node25.local,node26.local,node27.local,node28.local,node29.local,node30.local,node31.local,node32.local,node33.local,node34.local,node35.local,node36.local,node37.local,node38.local,node39.local,node40.local,node41.local,node42.local,node43.local,node44.local,node45.local,node46.local,node47.local,node48.local,node49.local,node50.local,node51.local,node52.local,node53.local,node54.local,node55.local,node56.local,node57.local,node58.local,node59.local,node60.local,node61.local,node62.local,node63.local,node64.local,node65.local,node66.local,node67.local,node68.local,node69.local,node70.local,node71.local,node72.local,node73.local,node74.local,node75.local,node76.local,node77.local,node78.local,node79.local properties+=p4

# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/free-nodes -o /usr/local/bin/free-nodes; chmod +x /usr/local/bin/free-nodes
# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/qterm -o /usr/local/bin/qterm; chmod +x /usr/local/bin/qterm

Here are the final result:

.------------------------------------------------------------.
|     * OpenPBS NODES REPORT (v0.052) * (by mjhsieh)         |
`------------------------------------------------------------'
      Queue         Free       CPU       Nodes     Nodes Down 
      Name          Nodes      in use    Defined   or Offline 
      ----------    -----      -----     -----     -----      
      p4            79         0         79        0
      --------------------------------------------------
      Summary:      81         0         81        0
                    There are 0 job(s) queued.

Further commands

Example to update node image: # yume --installroot /var/lib/systemimager/images/i386image update
No, previous command cannot be replaced by # chroot /var/lib/systemimager/images/i386image plus some other
Example to install/update nodes package: # cexec yume install -y vim-enhanced

Tuesday, May 26, 2009

g03 oops

Duh! I need to add this into /etc/sysctl.conf and run the sysctl command on every cluster node.

# for g03
kernel.randomize_va_space = 0

(skipped, the commands are easy part, the hard part is to recover this from my own rusty brain.

The profile file, beware of that if /usr/tmp is not working and gives you shits like "Erroneous write during file extend. write -1 instead of 4096", check the writability permission on $GAUSS_SCRDIR. Or just set GAUSS_SCRDIR to $HOME.

export PATH="/home/software/g03/g03:${PATH}"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/home/software/g03/g03:/home/software/g03/gv/lib"
export g03root="/home/software/g03/g03"
export GAUSS_EXEDIR="/home/software/g03/g03"
export GAUSS_SCRDIR="/usr/tmp"
export G03BASIS="/home/software/g03/g03/basis/"
export GAUSS_ARCHDIR="/home/software/g03/g03/arch"
export GV_DIR="/home/software/g03/gv"
alias gv=$GV_DIR'/gview'
alias gview=$GV_DIR'/gview'

My installation on the cluster:

$ find /home/software/g03 -type d
/home/software/g03
/home/software/g03/g03
/home/software/g03/g03/arch
/home/software/g03/g03/basis
/home/software/g03/g03/bsd
/home/software/g03/g03/tests
/home/software/g03/g03/tests/com
/home/software/g03/g03/tests/i386
/home/software/g03/g03/tests/rs6k
/home/software/g03/gv
/home/software/g03/gv/bin
/home/software/g03/gv/data
/home/software/g03/gv/data/biofrags
/home/software/g03/gv/data/biofrags/fragments
/home/software/g03/gv/data/elements
/home/software/g03/gv/data/elements/bitmaps
/home/software/g03/gv/data/elements/fragments
/home/software/g03/gv/data/fonts
/home/software/g03/gv/data/rgroups
/home/software/g03/gv/data/rgroups/bitmaps
/home/software/g03/gv/data/rgroups/fragments
/home/software/g03/gv/data/rings
/home/software/g03/gv/data/rings/bitmaps
/home/software/g03/gv/data/rings/fragments
/home/software/g03/gv/help
/home/software/g03/gv/help/icons
/home/software/g03/gv/help/pix
/home/software/g03/gv/help/refs
/home/software/g03/gv/lib

Thursday, October 16, 2008

Are you busy to receive the submissions?

A bash local command that returns "no" when the node asked was absolutely not running any job, otherwise "yes".

areyoubusy(){
njobs=$(pbsnodes -x $1 | grep \<jobs\> | sed -e 's/.*<jobs>//;s/<\/jobs\>.*$//' \
        | awk -F ',' 'END{print NF}')
if [ $njobs -gt 0 ]; then
   echo "yes"
else
   echo "no"
fi
}
# usage: areyoubusy node1.local

Monday, September 29, 2008

Some testing result on pmemd scaling and parallel computation

Message-ID: <6003f3300809291827p13fcdeew94133ee69f206f1b@mail.gmail.com>
Date: Mon, 29 Sep 2008 18:27:37 -0700
From: "Mengjuei Hsieh"
To: developers
Subject: Some testing result on pmemd scaling and parallel computation

Here is some recap on the JAC benchmark performance on different
parallel options I did this weekend.

We were trying to explore the options for network connection with
jumbo frame (also known as large MTU, mtu=9000 in linux) gigabit
ethernet local network to see if we can replace the previous parallel
computing solution of connecting two machines directly with an
ethernet cable (we called it sub-pairs to reflect the fact that by
doing so, the machines will be grouped in pairs). The reason is
obvious, grouping computing nodes in pairs is not an efficient way to
work with nor to manage the nodes.

We tested with the NetPipe benchmark to measure the performance of a
gigabit ethernet with or without jumbo frame, the benchmark is
consistent with general wisdom and references on the internet or on
the literature. I thought we could utilize more bandwidth with jumbo
frame ethernet.

First, I tested the scaling of amber 9 pmemd with lam/mpi or mpich on
jumbo frame ethernet. The configurations of the testing environment
look like this:

Two identical dell poweredge 1950, each comes with 2 intel xeon 5140
woodcrest duo-core processors, 4MB cache, 2GB RAM. Shared memory
interconnect / MPICH-1.2.6 / LAM-MPI 7.1.4  Intel Fortran 90 compiler,
Intel MKL

The results of the parallel performance are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.329           --
  2          0.628           95        (SMP)
  4          1.094           83        (SMP)
  4          0.965           73        (TCP, 1+1+1+1)
  4          0.819           62        (SMP/TCP, 2+2)
  8          0.987           37        (SMP/TCP, 4+4)

This does not meet the definition of "scaling" therefore the network
traffic was also measured and I found that in the case of the network
communication, only 30% of the bandwidth is recorded. For some
sidenotes, these are under the parameter of at least
P4_SOCKBUFSIZE=131072 (mpich) and net.core.rmem_max=131072
net.core.wmem_max=131072, similar results have been observed under
lam-mpi  rpi_tcp_short=131072.

Further test on direct connection pairs shows that the measurement is similar.

Therefore the benchmark fell back to amber 8 pmemd, which is the
original program we had in the sub-pair configuration.

the results of the parallel performance with amber 8 pmemd are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.203           --
  2          0.391           96        (SMP)
  4          0.465           57        (SMP)
  4          0.457           56        (SMP/TCP, 2+2)
  8          0.680           42        (SMP/TCP, 4+4)

Less efficient amber 8 pmemd makes the scaling factor of 4+4cpus
parallel computation look better, but the performance is definitely
not better. Similar results were observed on directly connected pairs.

The interest of this exploration then turns to the scaling of the
AMBER 10 pmemd, and the results are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.411           --
  4          1.329           80        (SMP)
  8          1.137           35        (SMP/TCP, 4+4)

At this point, I can say is don't expect anything too interesting from
gigabit ethernet performance. This conclusion is consistent with
observation from Dr. Duke and Dr. Walker.

A further benchmark has been done for Amber10 pmemd on a dual
quad-cores intel xeon E5410 machine (dell PE1950, 2.3GMhz, 6MB cache,
2G RAM):
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms (on the same machine, SMP mode)

#procs         nsec/day       scaling, %

  1          0.434           --
  2          0.815           94
  4          1.464           84
  6          1.964           75
  8          2.274           65

That's all. AMBER 10 pmemd rocks.

Bests,
--
Mengjuei

Monday, December 31, 2007

Protein Folding Promoted on the Youtube.

Thursday, November 23, 2006

OSCAR5: failed to recognize the SAS interface before mkfs

For some reason my DELL nodes doesn't like the SIS installer and it fails to load mptsas module automatically. MPTSAS is the SAS driver of LSI Logic SAS adaptor. For some reason the boel initrd sees scsi device but no /dev/sda... I manually added "modprobe mptsas" in the /var/lib/systemimager/scripts/pre-install file. However it's also okay to hack the master script. Later I will post my way to assign an internal IP for extra ethernet port.

Saturday, November 18, 2006

Simple NAT rules (tested on oscar 5)

I think this iptables rule should be sufficient if you want to have a very simple function that allows your client to connect some outside ip. Put this in /etc/sysconfig/iptables and restart iptables. Please note: my eth0 is for OSCAR intranet, and eth1 is for internet, please change them according to your own setting.

*nat
:PREROUTING ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A POSTROUTING -o eth1 -j MASQUERADE 
COMMIT
#
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A FORWARD -d 192.168.0.0/255.255.255.0 -i eth0 -j ACCEPT 
-A FORWARD -s 192.168.0.0/255.255.255.0 -i eth0 -j ACCEPT 
COMMIT

Monday, July 31, 2006

Backing-up OSCAR 5 Data

According to the start_over script, the OSCAR 5 data could possibly be backed-up with the tar command:

tar cf /backup/oscar.tar /opt/kernel_picker /etc/systemimager /etc/dhcpd.conf /etc/profile.d/00-modules.* /etc/profile.d/c3.* /etc/profile.d/oscar_home.* /etc/profile.d/ssh-oscar.* /etc/systeminstaller /opt/env-switcher* /opt/lam* /opt/maui /opt/modules /opt/mpich* /opt/pbs /opt/perl-Qt /opt/sync_files /tftpboot/initrd.img /tftpboot/kernel /tftpboot/pxelinux.* /usr/lib/systemimager /usr/lib/systeminstaller /var/lib/ganglia /var/lib/mysql /var/lib/oscar /var/lib/systemimager /var/log/systemimager /var/spool/pbs /opt/oscar /etc/httpd/conf.d/ganglia.conf /etc/ssh

This can be useful when you want to change your head node or simply backing it up. (This entry is an imaginary solution to backing up the essential part of the head node/cluster server.)

Backing-up OSCAR 3 Nodes by Hand

Just do this on the computing node:

rsync -avxHS --numeric-ids --delete --exclude 'lost+found' --exclude '/tmp/*' --exclude '/var/spool/clientmqueue/*' --exclude '/var/spool/pbs/aux/*' --exclude '/var/spool/pbs/mom_logs/*' --exclude '/var/spool/pbs/mom_priv/jobs/*' --exclude '/var/spool/pbs/spool/*' --exclude '/var/spool/pbs/undelivered/*' --exclude '*.pid' --exclude '*.lock' --exclude 'swapfile0' / 10.0.0.254:/usr/cluster/nodes/oscar3/root/

and also this:

rsync -avxHS --numeric-ids --delete --exclude 'lost+found'  --delete-excluded /boot/ 10.0.0.254:/usr/cluster/nodes/oscar3/boot/

Done. (Update: please be careful about the directories that has been ignored... I don't trust the exclude expression.)

Reimaging OSCAR 3 Nodes by Hand

Hard drive failure and irreparable filesystem are pretty common during the operation the cluster, the first thought that comes to us always is to re-imaging the pre-stored image from the server. However in OSCAR (at least in v3 as well as v4) you need to flush the OpenPBS setting or restart the PBS server. That causes a lot of problem if you have a lot of jobs running on the rest of the computing nodes. Here is my notes to reimage OSCAR 3 nodes manually:

pxe boot into fc2 installation untill load into gui(stage2)
alt-F2 enter the shell

To erase the previous partition, do:

parted -s -- /dev/sda mklabel msdos
parted -s -- /dev/sda print

Then build the partition table:

parted -s -- /dev/sda mkpart primary ext2 0 24
parted -s -- /dev/sda mkpart extended 24 76319
parted -s -- /dev/sda mkpart logical 24 2072
parted -s -- /dev/sda mkpart logical ext2 2072 76319

Format the partition:

mke2fs /dev/sda6
mkswap /dev/sda5
mke2fs /dev/sda1
mkdir -p /a/boot
mount /dev/sda6 /a/
mount /dev/sda1 /a/boot/

FC2 boot disk doesn't have rsync, retrieving from the server:

scp 10.0.0.254:/usr1/cluster/nodes/rayl3/root/usr/bin/rsync /usr/bin/

Copy:

rsync -avxHS --numeric-ids --exclude 'lost+found' \ 
        10.0.0.254:/usr/cluster/nodes/oscar3/root/ /a/
rsync -avxHS --numeric-ids --exclude 'lost+found' \ 
        10.0.0.254:/usr/cluster/nodes/oscar3/boot/ /a/boot/

Create swapfile since we have extra swap requirement for gaussian and didn't enlarge the swap size at the first place:
```
dd if=/dev/zero of=/a/var/vm/swapfile0 bs=1M count=1024
mkswap /a/var/vm/swapfile0
```
Modify the files /etc/sysconfig/network, /etc/sysconfig/network-scripts/ifcfg-eth0, /etc/pfilter.cmds, /etc/pfilter.src for ip address

Restoring GRUB

chroot /a/
grub
root (hd0,0)
setup (hd0)
quit
exit

Reboot, Change the BIOS boot sequence if necessary.

Sunday, July 23, 2006

OSCAR 4 with direct links

Direct Links

For computing nodes that have more than one GbE port, it might be a good idea to do direct connection between two nodes and taking advantage of the network speed for parallel computation. What I did was to set up IP addresses for the direct link ports, for instance, 192.168.0.1 for odd numbered nodes, 192.168.0.2 for even numbered nodes.

PBS/Torque

In order to access the specified resource through OpenPBS/Torque, we need to create a customized queue for the paired nodes since it's a peer to peer direct link. What I did was to use qmgr and import these commands:

# I copied the resources settings from the workq of OSCAR 4, 
# it might be different from the default workq of OSCAR 5
create queue subpair01
set queue subpair01 queue_type = Execution
set queue subpair01 resources_max.cput = 10000:00:00
set queue subpair01 resources_max.ncpus = 8
set queue subpair01 resources_max.nodect = 2
set queue subpair01 resources_max.walltime = 10000:00:00
set queue subpair01 resources_min.cput = 00:00:01
set queue subpair01 resources_min.ncpus = 1
set queue subpair01 resources_min.nodect = 1
set queue subpair01 resources_min.walltime = 00:00:01
set queue subpair01 resources_default.cput = 10000:00:00
set queue subpair01 resources_default.ncpus = 1
set queue subpair01 resources_default.nodect = 1
set queue subpair01 resources_default.walltime = 10000:00:00
set queue subpair01 resources_available.nodect = 2
set queue subpair01 enabled = True
set queue subpair01 started = True
set node node1.local,node2.local properties+=subpair01

Actually, you can save the commands into a file and use

gmgr < ./commands

By the way, before going into the parallel computation on direct link, make sure your ssh won't croak the signature stuffs. Make sure to use c3 to have it done (Like: cexec :1-2 ssh 192.168.0.1 uptime; cexec :1-2 ssh 192.168.0.2 uptime). (Actually, there are a lot of potential problem about ssh, I believe the signature problems are simplified by OSCAR installation.)

MPICH

Here I use a PBS script to submit my MPICH jobs, this example is for AMBER jac benchmark. Please read the script and see how I specify MPI_HOST to tell MPICH the routing of message traffic.

#!/bin/sh
#PBS -N "MPICHjob"
#PBS -q subpair01
#PBS -l nodes=2:subpair01:ppn=8
#PBS -S /bin/sh
#PBS -r n
cd /home/demo/MPICH_SUBPAIR
# customized machinefile
cat > machine.subpairN << EOF
192.168.0.1:4
192.168.0.2:4
EOF
# Tell mpich to run through the direct link
export MPI_HOST=`/sbin/ifconfig eth1 | grep "inet addr:" \ 
                | sed -e 's/inet addr://' | awk '{print $1}'`
# Recommended by Dave Case in the Amber mail list
export P4_SOCKBUFSIZE=524288

# Run
source /opt/intel/fc/9.0/bin/ifortvars.sh
/home/software/mpich_net/bin/mpirun -machinefile ./machine.subpairN -np 8 \ 
        /home/software/amber9/exe/pmemd.MPICH_NET -O -i mdin.amber9 -c \ 
        inpcrd.equil -p prmtop -o /tmp/output.txt -x /dev/null -r \ 
        /dev/null
# Data Retreival
mv /tmp/output.txt output.pmemd9.MPICH_SUBPAIR

LAM/MPI

Then these are the script for LAM/MPI, you can see I still need to specify the routing of traffic. Also the first node defined by lamboot may not be the same node that PBS send you to.

#!/bin/sh
#PBS -N "LAMMPIjob"
#PBS -q subpair01
#PBS -l nodes=2:subpair01:ppn=8
#PBS -S /bin/sh
#PBS -r n
cd /home/demo/LAM
# customized machinefile
cat > machine.subpairN << EOF
192.168.0.1 cpu=4
192.168.0.2 cpu=4
EOF

# if we don't specify -ssi boot rsh, lam will use boot tm and
# the IPs provided by pbs that uses the oscar lan.
/opt/lam/bin/lamboot -ssi boot rsh -ssi rsh_agent "ssh" -v machine.subpairN

# Run
source /opt/intel/fc/9.0/bin/ifortvars.sh

/opt/lam/bin/mpirun -ssi rpi sysv -np 8 \ 
        ./sander9.LAM -O -i mdin -c inpcrd.equil -p prmtop \ 
        -o /tmp/output.txt -x /tmp/trajectory.crd -r /tmp/restart.rst
/opt/lam/bin/lamhalt >& /dev/null
# Data Retreival
# becuase the master node is n0, not the first node of pbs
ssh 192.168.0.1 mv /tmp/output.txt /tmp/trajectory.crd /tmp/restart.rst /home/demo/LAM

Friday, July 21, 2006

OSCAR/branches/branch-5-0

Right now I've finished the installation of OSCAR 5.0a and put it into production.

One thing worth taking notice of is that the branch 5.0 is still changing. In order to keep it updated, I could use svn diff -r revision | less to check the change and upgrade the OSCAR installation to current stable branch. Practically, it would be much easier if I just download the nightly build just do a diff and update the package respectively (by hand, of course).

Monday, February 20, 2006

OSCAR 4 lam-mpi and ifort

This is a note for ifort with lam-mpi under OSCAR cluster. You can either compile lam by yourself:

% env CC=icc CXX=icpc FC=ifort F77=ifort F90=ifort ./configure \ 
   --prefix=/home/lammpi

Or, use "ifort -assume 2underscores" as your FC for your applications.

By the way, the run-time ssi option SYSV would be good for general cases since the communication goes through shared memory on the same machine and through tcp on different machines.

Also check this out:

#!/bin/sh
#PBS -N "LAMMPIjob"
#PBS -q workq
#PBS -l nodes=1:ppn=2
#PBS -S /bin/sh

# To check if this is forked by qsub
# Not necessary
if [ -z "$PBS_ENVIRONMENT" -a "$SSH_TTY" ]
then
    # go to some directory
    cd /home/demo/LAM
else
    # go to some directory
    cd $PBS_O_WORKDIR
fi
# using the lazy default tm module (pbs)
/opt/lam-7.0.6/bin/lamboot
/opt/lam-7.0.6/bin/mpirun -ssi rpi sysv -np 2 \ 
         ./sander.LAM -O -i mdin -c inpcrd.equil -p prmtop \ 
        -o /tmp/output.txt -x /tmp/trajectory.crd -r /tmp/restart.rst
/opt/lam-7.0.6/bin/lamhalt
# Data Retreival
mv /tmp/output.txt /tmp/trajectory.crd /tmp/restart.rst .

Monday, November 21, 2005

Epilogue Script in OpenPBS

This script may not be able to correctly clean the orphan processes you want to remove. I recommend giving LAM/MPI a try instead of MPICH.

To clean-up the process left after jobs exiting the nodes, an epilogue script is a convenient choice. Here is an example (although this example is not compatible with all scenarios) for Torque in OSCAR 4.x package:

#!/bin/sh
# Please notice that ALL processes from $USER will be killed (!!!)
echo '--------------------------------------'
echo Running PBS epilogue script

# Set key variables
USER=$2
NODEFILE=/var/spool/pbs/aux/$1
PPN=`/bin/sort $NODEFILE | /usr/bin/uniq | /usr/bin/wc -l`
if [ "$PPN" = "1" ]; then
   # only one processor used
   echo Done.
   #su $USER -c "skill -v -9 -u $USER"
else
   # more than one cpu used
   echo Killing processes of user $USER on the batch nodes
   for node in `cat $NODEFILE`
        do
        echo Doing node $node
        su $USER -c "ssh -a -k -n -x $node skill -v -9 -u $USER"
   done
   echo Done.
fi

Monday, September 19, 2005

AMBER 8 PMEMD on Hyper-Threading Machines

(Update: Mr. Yuen pointed out when using Linpack (HPL) to do the benchmark, the nodes with hyper-threading on are much slower. Therefore I have to say, "your mileage might vary". I believe this is due to the overhead of the context switching, the PMEMD program we tested does has very small memory footprint and HPL is a giant matrix multiply.)

Some time I found myself doing benchmarks without science. So let me write this more scientifically. For a long time we have been wondering that whether or not hyper-threading mode (logical CPU mode in DELL's term) is good enough for us to turn it on in the cluster. Therefore we have done benchmark tests to show how helpful hyper-threading is in our configuration. I tried to be more objective in this note, however different network topology hardware configuration may cause different conclusion.

The configuration in this benchmark is shown here:

OSCAR 4.2 (pre-beta version) with Fedora Core 3 Linux
Intel Xeon 2.8G with 1MB cache
Direct GbE connection between two machines
LAMMPI/MPICH version of PMEMD from AMBER 8 distribution
According to the document provided by Dr. Duke, P4_SOCKBUFSIZE is set to 524288 for MPICH (and /etc/sysctl.conf on the nodes has to be changed accordingly.).

Figure 1. The JAC benchmark on different node/processor combinations. Denoted "spreading nodes first" means to distribute the threads to as many node as possible. In this plot we can see the performance of 8 thread (4 threads on each node) without hyper-threading is actually much worse than 4 threads (2 threads on each node). It also implies that hyper-threading is good at stressing test.

Figure 2. This plot shows that if we populate one node first then populate the other node, we can see linear scaling (when ignoring 2-threads calculation).

The scaling of JAC calculation seems to be fine probably because the footprint of PMEMD and the simulation system is small. Perhaps a benchmark on bigger system (JAC simulation is on DHFR protein with 159 amino acid residues, explicit water representation.) is needed. I also tried Jumbo Frames (GbE that MTU > 1500) setting, it does slow down the LAM calculation as other report expected.

Conclusion:

Hyper-threading does help.
Scaling on Hyper-threading machine can be linear depends on how you look at it.

References:

Dr. Bob Duke, Using Intel compilers (ifc8) with PMEMD
Joint Amber/Charmm DHFR benchmark, the information can be found at AMBER benchmark website.
Gelb Research Group at Washington University in St. Louis, "Fjord" - a linux cluster

Thursday, September 01, 2005

Preventing Users From Not Using PBS Loging on Computing Nodes

Normally an experienced beowulf cluster administrator would probably suggest people not to login the computing nodes directly. However we (I myself is a user, too.) tend to connect to the computing nodes and run something on them without going through the scheduler (or resource allocator we might say.). All right, just a small job and you don't want to run it on the server because the server is often busy. The administrator would probably be mad because he or she cannot let the resource fairly accessible to the users. Therefore following command (but only for OSCAR cluster or other cluster with PBS/Torque installed) I guess is for administrators to recommend their users to run:

$ qsub -I -N "interactivejob" -S /bin/tcsh -q workq -l nodes=1:ppn=1

This will let users login the computing nodes through the scheduler.

Do we think the users will follow the rules and giving up logining into the nodes? No we are not stupid. A civilized but lazy way is to beg the users in the /etc/motd:

Please! Please do not ssh into the node! We beg you!

Of course this doesn't work on hackers. Unfortunately people usually think of themselves as hackers. So if we put following local.csh script in the /etc/profile.d/ on the nodes, you can stop the manual login thru ssh:

if ( ! $?PBS_ENVIRONMENT ) then
   if ( $?SSH_TTY && `whoami` != "root" ) then
      echo; echo please stop login the node thru ssh; echo
      logout
   endif
endif

Or, as Jenna (in #oscar-cluster @ FreeNode) pointed out, use local.sh for bash/sh users:

[ -z "$PBS_ENVIRONMENT" -a "$SSH_TTY" -a `whoami` != "root" ] && logout

This kind of design will not prevent users from using cexec, mpi, qsub or pbsqsh. However it doesn't guarantee users are absolutely not able to ssh to the nodes. If users intend to do so, the admin should use more civilized communication skills, not go into a technical fight.

As this problem going away, now we face another problem. People will just do their stuff on the server because they can't login into the nodes. And qsub is such a hassle a genius won't use it. Screw you guys, I am going home. 凸

Wednesday, August 24, 2005

Booting Nodes with Memtest86/Memtest86+ Under OSCAR 4

OSCAR 5 will bundle with netbootmgr, it utilizes the beauty of pxeboot and manage the boot option from the server. Great Feature!

This is the way to add memtest86/memtest86+ by directly modifying (or hacking I should say.) the systemimager's image. Of course the better way is to find the kernel switcher/systeminstaller script and modify it in the OSCAR 4 packages, if I have time, really. My kludge is to add an entry of kernel in /var/lib/systemimager/images/IDEimage/etc/systemconfig/systemconfig.conf (if the name of your OSCAR image is IDEimage). Also you need to copy the memtest86 booting image to the /var/lib/systemimager/images/IDEimage/boot directory.

So my systemconfig.conf looks like this, I colored the section I modified as red:

# systemconfig.conf written by systeminstaller.
CONFIGBOOT = YES
CONFIGRD = YES

[BOOT]  
        ROOTDEV = /dev/hda6
        BOOTDEV = /dev/hda
        DEFAULTBOOT = 2.6.12-1.1372_F

[KERNEL0]
        PATH = /boot/vmlinuz-2.6.12-1.1372_FC3smp
        LABEL = 2.6.12-1.1372_F

[KERNEL1]
        PATH = /boot/vmlinuz-2.6.12-1.1372_FC3
        LABEL = 2.6.12-1.1372_F

[KERNEL2]
        PATH = /boot/memtest86+-1.60.bin.gz
        LABEL = memtest86+

[KERNEL3]
        PATH = /boot/memtest86-3.2.bin.gz
        LABEL = memtest86

And you can find memtest86 at http://www.memtest86.com/ and memtest86+ at http://www.memtest.org/. Well, if you are really bored, you can modify grub splash.xpm.gz (the background image of grub kernel loader menu), just make sure it's 14 indexed colors 640x480 xpm also need to translate the color name for example gray25 to hex rgb (#404040). You can replace the original /var/lib/systemimager/images/IDEimage/boot/grub/splash.xpm.gz with your masterpiece. For example like this....
grub splashimage

Tuesday, August 23, 2005

Ethernet Bridging Problems with OSCAR 4.2β

For some reason, I am using a special but not rare configuration for our OSCAR clusters like this:

The reasons I designed it that way were quite simple.

I want larger bandwidth for each computing node.
Because I can.

Theoretically it does provide enough bandwidth for NFS abusing, but I learned my lesson these two months ---- my server still needs sophisticated tuning in order to gain the benefit of this bandwidth monster. Go back to the ethernet bridge in OSCAR, it was sort of okay when I was doing OSCAR 4.0 beta for our previous 100 nodes cluster and the systeminstaller problem was solved. However I was not able to connect from a node in one bridge to other nodes in another bridge then. I was pretty sure it was the problem of firewall setting although I was lazy enough to live with it. Besides, we are not using slow GbE to run MPI calculation over the network anyway. Yesterday I finally got the chance to look at the pfilter configurations and hacked it in an ugly way. :-) I am no firewall expert...

Here let me write done the configuration for someone who might find it interesting. First of all, my linux server (head node) has 6 GbE network ports and I already used the first one connecting to the Internet. I just needed to find the way to bind these 5 ethernet ports to the same ip and share the bandwidth load. In Fedora Core 2/3 Linux, it's very easy to do. Just set the members of the ethernet bridge to ONBOOT=yes, IPADDR=0.0.0.0 and BRIDGE=br0 (this br0 is an virtual device for bridge.) In the mean time, the ifcfg-br0 setting would be like this:

DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
IPADDR=10.0.0.254
NETMASK=255.255.255.0
ONBOOT=yes
DELAY=0
STP=on

For more information, check out the FAQ of ethernet bridge. Now I presumed the readers of this nonsense blog entry already have installed oscar in /opt/oscar, but not yet started "./install_cluster br0", you can modify the script /opt/oscar/packages/pfilter/scripts like this: (if you like the patch file more ... )

--- post_clients.orig   2005-08-23 20:35:18.000000000 -0700
+++ post_clients        2005-08-23 20:34:44.000000000 -0700
@@ -176,7 +176,7 @@
 
 # the server and every compute node trust each other
 
-trusted %oscar_server% %nodes%
+trusted %oscar_server% %nodes% $on_interface
 open    multicast                        # for ganglia
 
 #

Or, if you already did "./install_cluster br0", just modify /etc/pfilter.conf and add br0 in the line of "trusted %oscar_server% %nodes%" and then issue "serivce pfilter restart". That's it, your computing nodes can connect through the bridge interface now.

2005-08-23 03:55:13