[gmx-users] MPI scaling (was RE: MPI tips)

Tue Jan 31 23:00:47 CET 2006

Presumably I'm doing something wrong here but so far the
gromacs MPI performance has been abysmal. Gromacs 3.3, lam-mpi
7.1.1, 100baseT switched network, 20 compute nodes (max).  Gromacs
was built shared and the relevant .so libraries are all in
/usr/common/lib which shows up as that location on all nodes.
Additionally that path is in /etc/ld.so.conf and ldconfig was
run on all nodes after gromacs was set up with "make install".

It was suggested that the gmxdemo example was too small so today
I tried changing the original -d .5 value used with editconf to
-d 2, -d 4, and finally -d 8.  Details are:

-d   vol (nm^3)
.5   17     
2   175
4   884
8  5453

This uses lam-mpi and grompp -shuffle (but not -sort)  The
size of the -d 8 box was 17.4 x 18.1 x 17.2.  In each case the same
peptide is embedded in larger and larger box of water.   Results
were checked visually and the trajectory of each run showed the
peptide bopping around in more or less the same way.

As I noted before, the CPU usage on the compute nodes was
very poor except for N=1.  Things did improve a bit with
increasing volume but in no case did running 20 nodes do
better than 2x as fast as running one node. In the "demo"
example mdrun is run 3 times and these are 1st,2nd,3rd
in the following tables:

d=.5   1 node  CPU  |  4 node  CPU   | 20 node  CPU 
1st    2.7s    >85% |   8.4s   ~10%  |  14.6s   <10%
2nd    5.4s    >98% |  28.1s  8-12%  |  44.4s   <10%
3rd   46.4s    >98% | 248.8s  7-10%  | 452.2s   <10$

d=2    1 node  CPU  |  4 node  CPU   | 20 node  CPU 
1st    23.8s   >98% |  13.6s  50-60% |  27.0s   <20%
2nd    47.2s   >98% |  39.1s  40-50% |  69.1s   <20%
3rd   452.3s   >98% | 352.1s  40-50% | 617.8s   <20$

d=4    1 node  CPU  |  4 node  CPU   | 20 node  CPU 
1st    112.5s  >98% |  63.5s  50-60% |  56.2s   <35%
2nd    226.7s  >98% | 192.8s  40-50% |  165.8s  <30%
3rd   2229.5s  >98% |1531.6s  30-50% | 1369.9s  <25%

d=8    1 node  CPU  |  4 node  CPU   | 20 node  CPU 
1st    697.6s  >98% |  -       -     |  334.4   <35%
2nd   1427.1s  >98% |  -       -     |  944.9   <35%
3rd   -        -    |  -       -     |  -       -

The d=8 box is a bit bigger than the experiment I actually 
want to run and the scaling still stinks.  20 processors
runs only 2x faster.  Moreover gstat showed more
clearly what top was showing - there was a distribution of node
CPU usages which varied from 0 to about 30%, occasionally spiking
up to 35% or higher (this is for d=8, 20 nodes).  That is, the
compute nodes were sitting around twiddling their proverbial thumbs
most of the time.  The cpu usage values would shift around over
time, that is, node A might use little CPU for a long time,
then use a lot, then go back to a little.

Are others seeing better CPU utilization with the MPI
version of mdrun?  (Run something for a couple of minutes, do gstat,
and look at the 1 minute load column.)

In none of the runs presented above was there any significant
CPU activity on the master node, where the jobs were started, when
mdrun was active on the compute nodes.

I looked at the (3.2) manual and it didn't suggest anything to
me that would improve matters.

I did just try adding also -sort to grompp but it didn't change
the d=8, n=20 values shown above significantly.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech