Monday, September 19, 2005

AMBER 8 PMEMD on Hyper-Threading Machines

(Update: Mr. Yuen pointed out when using Linpack (HPL) to do the benchmark, the nodes with hyper-threading on are much slower. Therefore I have to say, "your mileage might vary". I believe this is due to the overhead of the context switching, the PMEMD program we tested does has very small memory footprint and HPL is a giant matrix multiply.)
Some time I found myself doing benchmarks without science. So let me write this more scientifically. For a long time we have been wondering that whether or not hyper-threading mode (logical CPU mode in DELL's term) is good enough for us to turn it on in the cluster. Therefore we have done benchmark tests to show how helpful hyper-threading is in our configuration. I tried to be more objective in this note, however different network topology hardware configuration may cause different conclusion.

The configuration in this benchmark is shown here:

  • OSCAR 4.2 (pre-beta version) with Fedora Core 3 Linux
  • Intel Xeon 2.8G with 1MB cache
  • Direct GbE connection between two machines
  • LAMMPI/MPICH version of PMEMD from AMBER 8 distribution
  • According to the document provided by Dr. Duke, P4_SOCKBUFSIZE is set to 524288 for MPICH (and /etc/sysctl.conf on the nodes has to be changed accordingly.).


Figure 1. The JAC benchmark on different node/processor combinations. Denoted "spreading nodes first" means to distribute the threads to as many node as possible. In this plot we can see the performance of 8 thread (4 threads on each node) without hyper-threading is actually much worse than 4 threads (2 threads on each node). It also implies that hyper-threading is good at stressing test.


Figure 2. This plot shows that if we populate one node first then populate the other node, we can see linear scaling (when ignoring 2-threads calculation).

The scaling of JAC calculation seems to be fine probably because the footprint of PMEMD and the simulation system is small. Perhaps a benchmark on bigger system (JAC simulation is on DHFR protein with 159 amino acid residues, explicit water representation.) is needed. I also tried Jumbo Frames (GbE that MTU > 1500) setting, it does slow down the LAM calculation as other report expected.

Conclusion:

  1. Hyper-threading does help.
  2. Scaling on Hyper-threading machine can be linear depends on how you look at it.

References:

  1. Dr. Bob Duke, Using Intel compilers (ifc8) with PMEMD
  2. Joint Amber/Charmm DHFR benchmark, the information can be found at AMBER benchmark website.
  3. Gelb Research Group at Washington University in St. Louis, "Fjord" - a linux cluster
  4. G5-P4-Xeon-DoubleLinpack

No comments: