[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x

Thu Sep 3 04:52:47 CEST 2009

Dear Gromacs users, (all related to GROMACS ver 4.0.x)

I am facing a very strange problem on a recently acquired supermicro 8
XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
DDR): I had been testing these nodes with one of our most familiar
protein model (49887 atoms: 2873 for protein and the rest for water
into a dodecahedron cell) which I known scales almost linearly until
32 cores in a quad-core/node Opteron 2.4 GHz cluster. Now, with our
recently acquired nodes I have severe imbalance PME/PP ratios (from
20% and up). At the beginning I think that this problem was related to
Infiniband latency problems, but recently I made a test that gave me a
big surprise: since my model scales very well to 8 cores I spreaded it
to 8 cores into four machines and the performance was the same than in
a single node, which in turns suggests that the problem could be
caused by a different reason that latency. After several tests I
realized that the problem arises when the process is divided into PME
and PP nodes, even into a single node!!!, it is to say:
-if for a short job I do (it is exactly the same for a long run):
srun -n8 /home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun
-v -dlb yes -deffnm FULL01/full01

 Average load imbalance: 0.7 %
 Part of the total run time spent waiting due to load imbalance: 0.2 %
 Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.         8        101       19.123        7.6     2.5
 Vsite constr.          8       1001        2.189        0.9     0.3
 Comm. coord.           8       1001        5.810        2.3     0.8
 Neighbor search        8        101       51.432       20.4     6.7
 Force                  8       1001      250.938       99.5    32.7
 Wait + Comm. F         8       1001       15.064        6.0     2.0
 PME mesh               8       1001      337.946      133.9    44.1
 Vsite spread           8       2002        2.991        1.2     0.4
 Write traj.            8          2        0.604        0.2     0.1
 Update                 8       1001       17.854        7.1     2.3
 Constraints            8       1001       35.782       14.2     4.7
 Comm. energies         8       1001        1.407        0.6     0.2
 Rest                   8                  25.889       10.3     3.4
-----------------------------------------------------------------------
 Total                  8                 767.030      304.0   100.0
-----------------------------------------------------------------------

	Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     38.000     38.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    254.161     14.534     11.380      2.109

which in turns reflects that there are not separation between PME and
PP and scaling is almost lineal compared with 1 processor.  But if I
force PME, and use exactly the same number of processors :
srun -n8 --cpu_bind=rank
/home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v
-dlb yes -npme 3 -deffnm FULL01/full01

 Average load imbalance: 0.5 %
 Part of the total run time spent waiting due to load imbalance: 0.2 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
 Average PME mesh/force load: 1.901
 Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %

NOTE: 23.9 % performance was lost because the PME nodes
      had more work to do than the PP nodes.
      You might want to increase the number of PME nodes
      or increase the cut-off and the grid spacing.

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.         5        101       14.660        5.9     1.4
 Vsite constr.          5       1001        1.440        0.6     0.1
 Send X to PME          5       1001        4.601        1.9     0.5
 Comm. coord.           5       1001        3.229        1.3     0.3
 Neighbor search        5        101       48.143       19.4     4.8
 Force                  5       1001      252.340      101.8    25.0
 Wait + Comm. F         5       1001        8.845        3.6     0.9
 PME mesh               3       1001      304.447      122.9    30.1
 Wait + Comm. X/F       3       1001       73.389       29.6     7.3
 Wait + Recv. PME F     5       1001      219.552       88.6    21.7
 Vsite spread           5       2002        3.828        1.5     0.4
 Write traj.            5          2        0.555        0.2     0.1
 Update                 5       1001       17.765        7.2     1.8
 Constraints            5       1001       31.203       12.6     3.1
 Comm. energies         5       1001        1.977        0.8     0.2
 Rest                   5                  25.105       10.1     2.5
-----------------------------------------------------------------------
 Total                  8                1011.079      408.0   100.0
-----------------------------------------------------------------------

	Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     51.000     51.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    189.377     10.354      8.479      2.831

As you can see I got a very bad performance, and this is also true if
I do not specify the number of PME nodes and spread the job into 11
processors (and goes worst with more processors), which gives me:
srun -n11 --cpu_bind=rank
/home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v
-dlb yes -deffnm FULL01/full01

NOTE: 11.9 % performance was lost because the PME nodes
      had more work to do than the PP nodes.
      You might want to increase the number of PME nodes
      or increase the cut-off and the grid spacing.

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.         6        101       15.450        6.2     1.6
 Vsite constr.          6       1001        1.486        0.6     0.2
 Send X to PME          6       1001        1.154        0.5     0.1
 Comm. coord.           6       1001        3.832        1.5     0.4
 Neighbor search        6        101       47.950       19.1     5.1
 Force                  6       1001      250.202       99.7    26.7
 Wait + Comm. F         6       1001       10.022        4.0     1.1
 PME mesh               5       1001      314.841      125.5    33.6
 Wait + Comm. X/F       5       1001      111.565       44.5    11.9
 Wait + Recv. PME F     6       1001      102.240       40.8    10.9
 Vsite spread           6       2002        2.317        0.9     0.2
 Write traj.            6          2        0.567        0.2     0.1
 Update                 6       1001       17.849        7.1     1.9
 Constraints            6       1001       31.215       12.4     3.3
 Comm. energies         6       1001        2.274        0.9     0.2
 Rest                   6                  25.283       10.1     2.7
-----------------------------------------------------------------------
 Total                 11                 938.249      374.0   100.0
-----------------------------------------------------------------------

	Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     34.000     34.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    284.388     15.963     12.719      1.887

I had tried everything that came in my mind, from modify npme to
cpu_affinity and mdp cut-offs and fourierspacing, also recompiling
things and tried different versions of fftw. Please advice me with any
ideas, trps to test or any tips. My mdp options for this runs were:

integrator      = md			
dt              = 0.005			
nsteps          = 1000		

pbc             = xyz		
nstlist         = 10			
rlist           = 1.0			
ns_type         = grid		

coulombtype     = pme			
rcoulomb        = 1.0	

vdwtype         = cut-off		
rvdw            = 1.0	

tcoupl          = Berendsen		
tc-grps         = protein non-protein	
tau-t           = 0.1 0.1	
ref-t           = 318 318 	

Pcoupl          = Berendsen	
pcoupltype      = isotropic
tau-p           = 1.0			
ref-p           = 1.0 		
compressibility = 4.5e-5

fourierspacing       =  0.16
pme_order            =  4
optimize_fft         =  yes		
ewald_rtol           =  1e-5	

gen_vel              =  yes	
gen_temp             =  318	
gen_seed             =  173529	

constraints          =  all-bonds
constraint_algorithm =  lincs	
lincs_order          =  4	

nstxout             =  5000	
nstvout             =  5000	
nstfout             =  0	
nstlog              =  5000

nstenergy           =  5000
energygrps          =  Protein non-protein

Thanks.
Daniel Silva