Hard drive failure and irreparable filesystem are pretty common during the operation the cluster, the first thought that comes to us always is to re-imaging the pre-stored image from the server. However in OSCAR (at least in v3 as well as v4) you need to flush the OpenPBS setting or restart the PBS server. That causes a lot of problem if you have a lot of jobs running on the rest of the computing nodes. Here is my notes to reimage OSCAR 3 nodes manually:
- pxe boot into fc2 installation untill load into gui(stage2)
- alt-F2 enter the shell
- To erase the previous partition, do:
parted -s -- /dev/sda mklabel msdos parted -s -- /dev/sda print
- Then build the partition table:
parted -s -- /dev/sda mkpart primary ext2 0 24 parted -s -- /dev/sda mkpart extended 24 76319 parted -s -- /dev/sda mkpart logical 24 2072 parted -s -- /dev/sda mkpart logical ext2 2072 76319
- Format the partition:
mke2fs /dev/sda6 mkswap /dev/sda5 mke2fs /dev/sda1 mkdir -p /a/boot mount /dev/sda6 /a/ mount /dev/sda1 /a/boot/
- FC2 boot disk doesn't have rsync, retrieving from the server:
scp 10.0.0.254:/usr1/cluster/nodes/rayl3/root/usr/bin/rsync /usr/bin/
- Copy:
rsync -avxHS --numeric-ids --exclude 'lost+found' \ 10.0.0.254:/usr/cluster/nodes/oscar3/root/ /a/ rsync -avxHS --numeric-ids --exclude 'lost+found' \ 10.0.0.254:/usr/cluster/nodes/oscar3/boot/ /a/boot/
- Create swapfile since we have extra swap requirement for gaussian and didn't enlarge the swap size at the first place:
dd if=/dev/zero of=/a/var/vm/swapfile0 bs=1M count=1024 mkswap /a/var/vm/swapfile0
- Modify the files /etc/sysconfig/network, /etc/sysconfig/network-scripts/ifcfg-eth0, /etc/pfilter.cmds, /etc/pfilter.src for ip address
- Restoring GRUB
chroot /a/ grub root (hd0,0) setup (hd0) quit exit
- Reboot, Change the BIOS boot sequence if necessary.
No comments:
Post a Comment