Monday, September 29, 2008

Some testing result on pmemd scaling and parallel computation

Message-ID: <6003f3300809291827p13fcdeew94133ee69f206f1b@mail.gmail.com>
Date: Mon, 29 Sep 2008 18:27:37 -0700
From: "Mengjuei Hsieh"
To: developers
Subject: Some testing result on pmemd scaling and parallel computation

Here is some recap on the JAC benchmark performance on different
parallel options I did this weekend.

We were trying to explore the options for network connection with
jumbo frame (also known as large MTU, mtu=9000 in linux) gigabit
ethernet local network to see if we can replace the previous parallel
computing solution of connecting two machines directly with an
ethernet cable (we called it sub-pairs to reflect the fact that by
doing so, the machines will be grouped in pairs). The reason is
obvious, grouping computing nodes in pairs is not an efficient way to
work with nor to manage the nodes.

We tested with the NetPipe benchmark to measure the performance of a
gigabit ethernet with or without jumbo frame, the benchmark is
consistent with general wisdom and references on the internet or on
the literature. I thought we could utilize more bandwidth with jumbo
frame ethernet.

First, I tested the scaling of amber 9 pmemd with lam/mpi or mpich on
jumbo frame ethernet. The configurations of the testing environment
look like this:

Two identical dell poweredge 1950, each comes with 2 intel xeon 5140
woodcrest duo-core processors, 4MB cache, 2GB RAM. Shared memory
interconnect / MPICH-1.2.6 / LAM-MPI 7.1.4  Intel Fortran 90 compiler,
Intel MKL

The results of the parallel performance are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.329           --
  2          0.628           95        (SMP)
  4          1.094           83        (SMP)
  4          0.965           73        (TCP, 1+1+1+1)
  4          0.819           62        (SMP/TCP, 2+2)
  8          0.987           37        (SMP/TCP, 4+4)

This does not meet the definition of "scaling" therefore the network
traffic was also measured and I found that in the case of the network
communication, only 30% of the bandwidth is recorded. For some
sidenotes, these are under the parameter of at least
P4_SOCKBUFSIZE=131072 (mpich) and net.core.rmem_max=131072
net.core.wmem_max=131072, similar results have been observed under
lam-mpi  rpi_tcp_short=131072.

Further test on direct connection pairs shows that the measurement is similar.

Therefore the benchmark fell back to amber 8 pmemd, which is the
original program we had in the sub-pair configuration.

the results of the parallel performance with amber 8 pmemd are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.203           --
  2          0.391           96        (SMP)
  4          0.465           57        (SMP)
  4          0.457           56        (SMP/TCP, 2+2)
  8          0.680           42        (SMP/TCP, 4+4)

Less efficient amber 8 pmemd makes the scaling factor of 4+4cpus
parallel computation look better, but the performance is definitely
not better. Similar results were observed on directly connected pairs.

The interest of this exploration then turns to the scaling of the
AMBER 10 pmemd, and the results are:
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms

#procs         nsec/day       scaling, %

  1          0.411           --
  4          1.329           80        (SMP)
  8          1.137           35        (SMP/TCP, 4+4)

At this point, I can say is don't expect anything too interesting from
gigabit ethernet performance. This conclusion is consistent with
observation from Dr. Duke and Dr. Walker.

A further benchmark has been done for Amber10 pmemd on a dual
quad-cores intel xeon E5410 machine (dell PE1950, 2.3GMhz, 6MB cache,
2G RAM):
*******************************************************************************
JAC - NVE ensemble, PME, 23,558 atoms (on the same machine, SMP mode)

#procs         nsec/day       scaling, %

  1          0.434           --
  2          0.815           94
  4          1.464           84
  6          1.964           75
  8          2.274           65

That's all. AMBER 10 pmemd rocks.

Bests,
--
Mengjuei

No comments: