Please leave a comment blow if anyone is reading this.
First of all, my server configuration is a Dell PE2950 with 8G memory and 6 network ports. The first time I tried to install OSCAR, I was using CentOS 5.3 x86_64 with OSCAR 6.0.3-1. However the OSCAR 6.0.3-1 is not stable enough for me to install things without errors. No luck for Fedora 9 with OSCAR 6.0.3-1 either, thanks to many perl package obstacles. The first time I fell-back to OSCAR 5.1+, I was trying Fedora 9 under a false impression of compatibility but unfortunately it's not. So here I come Fedora 8 x86_64! My goal is to install a cluster with ia32 (i386) nodes on a x86_64 server utilizing all the 6 network ports.Here are some notes:
- Since I installed the Fedora 8 from the disk+network without any modification (the only option I had was I chose to install developers software option), I only have 2G swap partition space, I need to use a swapfile.
# dd if=/dev/zero of=/swapfile0 bs=1M count=8192
# mkswap /swapfile0; chmod 600 /swapfile0; swapon /swapfile0
# echo "/swapfile0 swap swap defaults 0 0" >> /etc/fstab
# chkconfig NetworkManager off
(in fc8 after the text-mode installation, the default setting is off)# chkconfig network on
(in fc8 after the text-mode installation, the default setting is on)# chkconfig iptables off
# chkconfig ip6tables off
- Make sure you really turned off the iptables and NetworkManager, the GUI tools might be deceptive.
# perl -pi -e 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
# yum -y install bridge-utils gnuplot grace
# brctl addbr br0
# brctl addif br0 eth1; brctl addif br0 eth2; brctl addif br0 eth3; brctl addif br0 eth4; brctl addif br0 eth5
- edit /etc/sysconfig/network-scripts/ifcfg-eth[1-5] accordingly with BOOTPROTO=none, ONBOOT=yes, BRIDGE=br0, NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no
- edit /etc/sysconfig/network-scripts/ifcfg-eth0 to set NM_CONTROLLED=no, IPV6INIT=no, IPV6_AUTOCONF=no
- create /etc/sysconfig/network-scripts/ifcfg-br0 with contents of
DEVICE=br0 TYPE=Bridge BOOTPROTO=static IPADDR=10.0.0.254 NETMASK=255.255.255.0 ONBOOT=yes NOZEROCONF=yes DELAY=0 STP=no NM_CONTROLLED=no IPV6INIT=no IPV6_AUTOCONF=no
- create /etc/modprobe.d/disableipv6 with one line:
install ipv6 /bin/true
- edit /etc/sysconfig/network to modify the hostname and use the command "hostname" to change current setting, too.
- Add oscar_server nfs_oscar pbs_oscar for 10.0.0.254 into the file /etc/hosts
- Add all the names for this server into the file /etc/mail/local-host-names
- Add
ALL : 10.0.0.0/255.255.255.0,localhost,the-external-ip-for-your-server sshd : 10.0.0.0/255.255.255.0,.uci.edu
into the file /etc/hosts.allow . (I am from .uci.edu .) - Add
ALL : ALL EXCEPT LOCAL
into the file /etc/hosts.deny . - If httpd was installed, add
ServerAdmin yourname@yourmail.box ServerSignature Off ServerTokens Prod
into a new file of /etc/httpd/conf.d/lab.conf # yum remove NetworkManager.i386
(important if you want to do full uptodate with yum.# yum update
(optional, but it should be very helpful.)# reboot
- I have doubts on OSCAR 6, so I chose to use OSCAR 5.x from the nightly branch, downloaded oscar-repo-common-rpms-5*nightly-*.tar.gz, oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz and oscar-repo-fc-8-i386-5*nightly-*.tar.gz
# mkdir -p /tftpboot/distro /tftpboot/oscar; tar xzfC oscar-repo-common-rpms-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-x86_64-5*nightly-*.tar.gz /tftpboot/oscar/; tar xzfC oscar-repo-fc-8-i386-5*nightly-*.tar.gz /tftpboot/oscar/
# perl -pi -e 's/gpgcheck=1/gpgcheck=0/' /etc/yum.conf
# yum install createrepo /tftpboot/oscar/common-rpms/yume*.rpm
# yume --repo /tftpboot/oscar/common-rpms install oscar-base
# rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/x86_64/os/Packages/ /tftpboot/distro/fedora-8-x86_64/
- rsync -avx --delete --bwlimit=128 rsync://archive.fedoraproject.org/fedora-archive/fedora/linux/releases/8/Fedora/i386/os/Packages/ /tftpboot/distro/fedora-8-i386/
- create /tftpboot/distro/fedora-8-i386.url with one line:
file:/tftpboot/distro/fedora-8-i386
since this file won't be generated automatically. (not sure if fedora-8-x86_64.url was generated or bundled with packages or not, check it anyway.) # cd /opt/oscar/lib
# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/i386.newkey --repo /tftpboot/distro/fedora-8-i386
# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-i386
# yume --prepare --repo /tftpboot/distro/fedora-8-i386
# ../scripts/repo-update --url http://archive.fedoraproject.org/pub/archive/fedora/linux/updates/8/x86_64.newkey --repo /tftpboot/distro/fedora-8-x86_64
# ../scripts/repo-update --rmdup --repo /tftpboot/distro/fedora-8-x86_64
# rm /tftpboot/distro/*/*torque* /tftpboot/distro/*/openmpi*rpm
# yume --prepare --repo /tftpboot/distro/fedora-8-x86_64
- Make sure that /tftpboot/distro/fedora-8-x86_64.url, /tftpboot/distro/fedora-8-i386.url, /tftpboot/oscar/fc-8-i386.url and /tftpboot/oscar/fc-8-x86_64.url exist and contain correct URL information.
# perl -pi -e 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config
# yum install perl-AppConfig
# perl -pi -e 's/\/usr\/sbin\/netbootmgr/\/usr\/bin\/netbootmgr/' /opt/oscar/scripts/oscar_wizard
# cd /opt/oscar; env OSCAR_VERBOSE=3 ./install_cluster br0
- uncheck the loghost from the package list because it is not working in fc8.
- normally you don't need to worry about the following configuration step, I didn't change anything but ganglia seems to be a good place to do some modification on the default setting.
- install server packages
- Do remember that everytime you run "./install_cluster" your previous setting via install_cluster will be lost. Redo the setting and installing the package. To avoid that, after "install server packages" step is done, if for some reason you exited the ./install_cluster command, you are free to use oscar_wizard. Here is how you do it: "1. use a new shell window other than your old ./install_cluster shell. 2. cd /opt/oscar/scripts; ./oscar_wizard"
- revise /opt/oscar/oscarsamples/scsi.disk for SATA/SCSI nodes configuration.
- revise /opt/oscar/oscarsamples/fc-8-i386.rpmlist
- build an image for i386 (ia32) nodes
- Use another shell window to modify the image:
# chroot chroot /var/lib/systemimager/images/i386image chkconfig avahi-daemon off
- Define a first node for test. Remember to change the oscarnode to node since the string 'oscarnode' is too long.
- Click 'setup networking', start collecting the MAC address and assign IPs, click 'Stop collecting MACs', click 'Configure DHCP Server' then click 'Setup Network Boot', wait for 'okay' popup.
- Open a new shell window to modify /tftpboot/kernel and /tftp/initrd.img. These two files are mistakenly linked to x86_64 versions, this is because the image architecture was labelled x86_64, apparently it's not what we wanted.
# rm /tftpboot/kernel /tftp/initrd.img # cd /tftpboot # cp -p /usr/share/systemimager/boot/i386/standard/kernel install-kernel-i386 # cp -p /usr/share/systemimager/boot/i386/standard/initrd.img install-initrd-i386.img # ln -s install-kernel-i386 kernel # ln -s install-initrd-i386.img initrd.img
- Disable floppy, Hyperthreading, keyboard missing warning... etc.
- Make sure the node set to PXE boot as the first priority in the boot list.
- Boot and install (I presume it went smoothly.)
- After the reboot, make sure the network interface is correct. Login and ping 10.0.0.254 to see if the network is working. I have a failure percentage of more than 10% failure that eth0 and eth1 are wrongly ordered and I could not alias them to the correct order, damn old hardware.
- In my case, I fired up netbootmgr and set the network failure node to "install"
- reboot the failure nodes and check the BIOS setting, make sure the RAM setting is at "default", all my failure nodes have wrong RAM specification settings.
- Another problem that the node didn't install is that the network is too busy, in this case, just do a reboot will do.
- After first node got installed successfully, click the 'Complete Cluster Setup' button.
- Don't forget to click the test cluster button
- Now install 80 more nodes.
- post installation modification:
perl -pi -e 's/tmpwatch -x \/tmp/tmpwatch -m -x \/tmp/;s/10d/100d/' /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch
- post installation modification:
# cpush /var/lib/systemimager/images/i386image/etc/cron.daily/tmpwatch /etc/cron.daily/tmpwatch
- post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
# for g03 (added by mengjuei hsieh) kernel.randomize_va_space = 0
- post installation modification:
# env -u DISPLAY cexec sysctl -w kernel.randomize_va_space=0
- post installation modification: append this in the end of file /var/lib/systemimager/images/i386image/etc/sysctl.conf :
# suggestion from ibm redbook (added by mjhsieh) vm.overcommit_memory = 1
- post installation modification:
# env -u DISPLAY cexec 'sysctl -w vm.overcommit_memory=1'
- post installation modification:
# cpush /var/lib/systemimager/images/i386image/etc/sysctl.conf /etc/sysctl.conf
- post installation modification:
# env -u DISPLAY cexec :1-81 cp -pr /home/software/compilers-i386 /opt/compilers
- post installation modification, this is for compilers:
# yume --installroot /var/lib/systemimager/images/i386image install compat-libstdc++-33 # cexec yume -y install compat-libstdc++-33 # echo "compat-libstdc++-33" >> /opt/oscar/oscarsamples/fc-8-i386.rpmlist
- For some reason I found that pbs_mom might also be running on the cluster/headnode even if I disabled it at the beginning, remember to check it out.
- For some reason torque-docs was not installed, fixing this by doing
# yume install torque-docs
, make sure you didn't install this with FC8's torque-docs. - I have 79 pentium 4 nodes and 2 nodes with old xeon (ia32), so I decided to create a PBS queue for pentium 4, so that I can allocate pentium 4 nodes: use the command "
# qmgr < P4.setting
", where P4.setting file contains:# # Create queues and set their attributes. # # # Create and define queue workq # create queue p4 set queue p4 queue_type = Execution set queue p4 resources_max.cput = 10000:00:00 set queue p4 resources_max.ncpus = 4 set queue p4 resources_max.nodect = 2 set queue p4 resources_max.walltime = 10000:00:00 set queue p4 resources_min.cput = 00:00:01 set queue p4 resources_min.ncpus = 1 set queue p4 resources_min.nodect = 1 set queue p4 resources_min.walltime = 00:00:01 set queue p4 resources_default.cput = 10000:00:00 set queue p4 resources_default.ncpus = 1 set queue p4 resources_default.nodect = 1 set queue p4 resources_default.walltime = 10000:00:00 set queue p4 resources_available.nodect = 2 set queue p4 enabled = True set queue p4 started = True # # Assign nodes to queue p4 # set node node1.local,node2.local,node3.local,node4.local,node5.local,node6.local,node7.local,node8.local,node9.local,node10.local,node11.local,node12.local,node13.local,node14.local,node15.local,node16.local,node17.local,node18.local,node19.local,node20.local,node21.local,node22.local,node23.local,node24.local,node25.local,node26.local,node27.local,node28.local,node29.local,node30.local,node31.local,node32.local,node33.local,node34.local,node35.local,node36.local,node37.local,node38.local,node39.local,node40.local,node41.local,node42.local,node43.local,node44.local,node45.local,node46.local,node47.local,node48.local,node49.local,node50.local,node51.local,node52.local,node53.local,node54.local,node55.local,node56.local,node57.local,node58.local,node59.local,node60.local,node61.local,node62.local,node63.local,node64.local,node65.local,node66.local,node67.local,node68.local,node69.local,node70.local,node71.local,node72.local,node73.local,node74.local,node75.local,node76.local,node77.local,node78.local,node79.local properties+=p4
# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/free-nodes -o /usr/local/bin/free-nodes; chmod +x /usr/local/bin/free-nodes
# curl http://mjhsieh.googlecode.com/svn/trunk/OpenPBS/qterm -o /usr/local/bin/qterm; chmod +x /usr/local/bin/qterm
Here are the final result:
.------------------------------------------------------------. | * OpenPBS NODES REPORT (v0.052) * (by mjhsieh) | `------------------------------------------------------------' Queue Free CPU Nodes Nodes Down Name Nodes in use Defined or Offline ---------- ----- ----- ----- ----- p4 79 0 79 0 -------------------------------------------------- Summary: 81 0 81 0 There are 0 job(s) queued.
Further commands
- Example to update node image:
# yume --installroot /var/lib/systemimager/images/i386image update
- No, previous command cannot be replaced by
# chroot /var/lib/systemimager/images/i386image
plus some other - Example to install/update nodes package:
# cexec yume install -y vim-enhanced
1 comment:
it's very hard to believe that I need yet another installation 3 months later.
Post a Comment