Message-ID: <6003f3300809291827p13fcdeew94133ee69f206f1b@mail.gmail.com> Date: Mon, 29 Sep 2008 18:27:37 -0700 From: "Mengjuei Hsieh" To: developers Subject: Some testing result on pmemd scaling and parallel computation Here is some recap on the JAC benchmark performance on different parallel options I did this weekend. We were trying to explore the options for network connection with jumbo frame (also known as large MTU, mtu=9000 in linux) gigabit ethernet local network to see if we can replace the previous parallel computing solution of connecting two machines directly with an ethernet cable (we called it sub-pairs to reflect the fact that by doing so, the machines will be grouped in pairs). The reason is obvious, grouping computing nodes in pairs is not an efficient way to work with nor to manage the nodes. We tested with the NetPipe benchmark to measure the performance of a gigabit ethernet with or without jumbo frame, the benchmark is consistent with general wisdom and references on the internet or on the literature. I thought we could utilize more bandwidth with jumbo frame ethernet. First, I tested the scaling of amber 9 pmemd with lam/mpi or mpich on jumbo frame ethernet. The configurations of the testing environment look like this: Two identical dell poweredge 1950, each comes with 2 intel xeon 5140 woodcrest duo-core processors, 4MB cache, 2GB RAM. Shared memory interconnect / MPICH-1.2.6 / LAM-MPI 7.1.4 Intel Fortran 90 compiler, Intel MKL The results of the parallel performance are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.329 -- 2 0.628 95 (SMP) 4 1.094 83 (SMP) 4 0.965 73 (TCP, 1+1+1+1) 4 0.819 62 (SMP/TCP, 2+2) 8 0.987 37 (SMP/TCP, 4+4) This does not meet the definition of "scaling" therefore the network traffic was also measured and I found that in the case of the network communication, only 30% of the bandwidth is recorded. For some sidenotes, these are under the parameter of at least P4_SOCKBUFSIZE=131072 (mpich) and net.core.rmem_max=131072 net.core.wmem_max=131072, similar results have been observed under lam-mpi rpi_tcp_short=131072. Further test on direct connection pairs shows that the measurement is similar. Therefore the benchmark fell back to amber 8 pmemd, which is the original program we had in the sub-pair configuration. the results of the parallel performance with amber 8 pmemd are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.203 -- 2 0.391 96 (SMP) 4 0.465 57 (SMP) 4 0.457 56 (SMP/TCP, 2+2) 8 0.680 42 (SMP/TCP, 4+4) Less efficient amber 8 pmemd makes the scaling factor of 4+4cpus parallel computation look better, but the performance is definitely not better. Similar results were observed on directly connected pairs. The interest of this exploration then turns to the scaling of the AMBER 10 pmemd, and the results are: ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms #procs nsec/day scaling, % 1 0.411 -- 4 1.329 80 (SMP) 8 1.137 35 (SMP/TCP, 4+4) At this point, I can say is don't expect anything too interesting from gigabit ethernet performance. This conclusion is consistent with observation from Dr. Duke and Dr. Walker. A further benchmark has been done for Amber10 pmemd on a dual quad-cores intel xeon E5410 machine (dell PE1950, 2.3GMhz, 6MB cache, 2G RAM): ******************************************************************************* JAC - NVE ensemble, PME, 23,558 atoms (on the same machine, SMP mode) #procs nsec/day scaling, % 1 0.434 -- 2 0.815 94 4 1.464 84 6 1.964 75 8 2.274 65 That's all. AMBER 10 pmemd rocks. Bests, -- Mengjuei
Monday, September 29, 2008
Some testing result on pmemd scaling and parallel computation
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment