Monday, July 31, 2006

Reimaging OSCAR 3 Nodes by Hand

Hard drive failure and irreparable filesystem are pretty common during the operation the cluster, the first thought that comes to us always is to re-imaging the pre-stored image from the server. However in OSCAR (at least in v3 as well as v4) you need to flush the OpenPBS setting or restart the PBS server. That causes a lot of problem if you have a lot of jobs running on the rest of the computing nodes. Here is my notes to reimage OSCAR 3 nodes manually:

  1. pxe boot into fc2 installation untill load into gui(stage2)
  2. alt-F2 enter the shell
  3. To erase the previous partition, do:
    parted -s -- /dev/sda mklabel msdos
    parted -s -- /dev/sda print
  4. Then build the partition table:
    parted -s -- /dev/sda mkpart primary ext2 0 24
    parted -s -- /dev/sda mkpart extended 24 76319
    parted -s -- /dev/sda mkpart logical 24 2072
    parted -s -- /dev/sda mkpart logical ext2 2072 76319
  5. Format the partition:
    mke2fs /dev/sda6
    mkswap /dev/sda5
    mke2fs /dev/sda1
    mkdir -p /a/boot
    mount /dev/sda6 /a/
    mount /dev/sda1 /a/boot/
  6. FC2 boot disk doesn't have rsync, retrieving from the server:
    scp 10.0.0.254:/usr1/cluster/nodes/rayl3/root/usr/bin/rsync /usr/bin/
  7. Copy:
    rsync -avxHS --numeric-ids --exclude 'lost+found' \ 
            10.0.0.254:/usr/cluster/nodes/oscar3/root/ /a/
    rsync -avxHS --numeric-ids --exclude 'lost+found' \ 
            10.0.0.254:/usr/cluster/nodes/oscar3/boot/ /a/boot/
  8. Create swapfile since we have extra swap requirement for gaussian and didn't enlarge the swap size at the first place:
    dd if=/dev/zero of=/a/var/vm/swapfile0 bs=1M count=1024
    mkswap /a/var/vm/swapfile0
  9. Modify the files /etc/sysconfig/network, /etc/sysconfig/network-scripts/ifcfg-eth0, /etc/pfilter.cmds, /etc/pfilter.src for ip address
  10. Restoring GRUB
    chroot /a/
    grub
    root (hd0,0)
    setup (hd0)
    quit
    exit
  11. Reboot, Change the BIOS boot sequence if necessary.

No comments: