[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x

Fri Oct 2 04:03:30 CEST 2009

Hi friends,

I want to update the status of this thread with good news. Last time I
said you that I was experimenting scaling problems with Gromacs 4.0.x
in a Centos cluster (el5) with infiniband Infinihost III Lx DDR. Now I
want to tell you that finally I made problems to disappear (discarded
infiniband problems, although I guess scaling can be better on
ConnectX). First, as Berk suggested the relationship between mesh and
cut-offs was corrected, but this does not made the trick, in fact the
MDP parameters that I showed on the previous message were intended to
test and improve scaling not to reach precision.
However after extensive testing I found that recompilation of mvapich2
and Gromacs 4.0.5 with Intel icc (version 11.1-046) made the trick.
Now we are obtaining around 40-50% better performance/core (EVEN IN
ONE CORE!!!), and scaling problems had gone. Now we can scale the
previous reported system to 32-cores pretty nicely, which in turn
contrasts against previous problems to scale to 12, 11 cores or even 8
cores with PME. It is important to note the I also proved gcc
(4.3.2-7) with optimizations: O3, march, and others, but  I was not
able to reach the performance of icc.
In order to compare previous numbers I post below the new numbers of
the reruns of the same system with the new compilation and 4, 8, 16
and 32 cores, please comment. Also note that "some runs" vs
"total-CPU-numbers" are not so good for scaling but these are guesses
of mdrun and now definitively I can better it by specifying explicitly
the number of PME nodes.

BTW: I found some combinations of fourier-spacing/cut-offs that simply
lead to INMEDIATE crash of my system (ff/cut: 0.13/1.0; 0.13/1.1;
0.135/1.1, 0.16/1.2, 0.16/1.2 ), however other combinations RUN STABLY
by at least 40ns: (ff/cut: 0.12/1.0; 0.14/1.0; 0.14/1.1; 0.14/1.2;
0.15/12; 0.15/13), this appears to be more related to fourier-spacing
than to cut-offs. When there is such combinations with errors, mdrun
complains about "cell length out of the domain decomposition cell of
their charge group" and/or cannot-settle some water(s). Is this
behavior normal?, if it is: what is the cause of this
fuourier/cut-offs failures? Thanks.

Daniel Silva
_________
srun -n 32  mdrun-mvapich2 -v -dlb yes -deffnm full01
_________
4 cores:
 Average load imbalance: 0.3 %
 Part of the total run time spent waiting due to load imbalance: 0.2 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.         4        501       29.329       11.7     1.3
 Vsite constr.          4       5001        1.957        0.8     0.1
 Comm. coord.           4       5001       11.370        4.5     0.5
 Neighbor search        4        501      117.072       46.8     5.0
 Force                  4       5001     1409.451      563.0    60.4
 Wait + Comm. F         4       5001       38.034       15.2     1.6
 PME mesh               4       5001      582.642      232.7    25.0
 Vsite spread           4      10002        6.085        2.4     0.3
 Write traj.            4         14        0.954        0.4     0.0
 Update                 4       5001       42.713       17.1     1.8
 Constraints            4       5001       61.328       24.5     2.6
 Comm. energies         4       5001        2.005        0.8     0.1
 Rest                   4                  30.367       12.1     1.3
-----------------------------------------------------------------------
 Total                  4                2333.306      932.0   100.0
-----------------------------------------------------------------------

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:    233.000    233.000    100.0
                       3:53
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    276.468     15.263      9.272      2.588

_________
8 cores:
 Average load imbalance: 0.6 %
 Part of the total run time spent waiting due to load imbalance: 0.3 %
 Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.         8        501       44.609       17.8     1.8
 Vsite constr.          8       5001        4.122        1.6     0.2
 Comm. coord.           8       5001       32.280       12.9     1.3
 Neighbor search        8        501      127.789       51.0     5.1
 Force                  8       5001     1393.396      556.0    55.2
 Wait + Comm. F         8       5001       71.092       28.4     2.8
 PME mesh               8       5001      694.855      277.3    27.5
 Vsite spread           8      10002        7.755        3.1     0.3
 Write traj.            8         14        1.060        0.4     0.0
 Update                 8       5001       43.490       17.4     1.7
 Constraints            8       5001       70.224       28.0     2.8
 Comm. energies         8       5001        3.471        1.4     0.1
 Rest                   8                  31.993       12.8     1.3
-----------------------------------------------------------------------
 Total                  8                2526.134     1008.0   100.0
-----------------------------------------------------------------------

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:    126.000    126.000    100.0
                       2:06
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    511.563     28.228     17.146      1.400
Finished mdrun on node 0 Thu Oct  1 20:50:28 2009

_________
16 cores:
 Average load imbalance: 0.8 %
 Part of the total run time spent waiting due to load imbalance: 0.6 %
 Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
 Average PME mesh/force load: 0.660
 Part of the total run time spent waiting due to PP/PME imbalance: 10.3 %

NOTE: 10.3 % performance was lost because the PME nodes
      had less work to do than the PP nodes.
      You might want to decrease the number of PME nodes
      or decrease the cut-off and the grid spacing.

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        10        501       43.668       17.5     1.5
 Vsite constr.         10       5001        2.534        1.0     0.1
 Send X to PME         10       5001        6.808        2.7     0.2
 Comm. coord.          10       5001       35.786       14.3     1.2
 Neighbor search       10        501      124.724       50.0     4.2
 Force                 10       5001     1384.184      554.6    46.8
 Wait + Comm. F        10       5001       74.121       29.7     2.5
 PME mesh               6       5001      584.282      234.1    19.8
 Wait + Comm. X/F       6       5001      523.711      209.9    17.7
 Wait + Recv. PME F    10       5001        4.567        1.8     0.2
 Vsite spread          10      10002        7.531        3.0     0.3
 Write traj.           10         14        1.177        0.5     0.0
 Update                10       5001       42.708       17.1     1.4
 Constraints           10       5001       66.270       26.6     2.2
 Comm. energies        10       5001       20.971        8.4     0.7
 Rest                  10                  31.757       12.7     1.1
-----------------------------------------------------------------------
 Total                 16                2954.800     1184.0   100.0
-----------------------------------------------------------------------

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     74.000     74.000    100.0
                       1:14
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    871.450     45.941     29.195      0.822

_________
32 cores:
 Average load imbalance: 1.4 %
 Part of the total run time spent waiting due to load imbalance: 1.0 %
 Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
 Average PME mesh/force load: 0.901
 Part of the total run time spent waiting due to PP/PME imbalance: 2.9 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        20        501       67.311       27.2     2.1
 Vsite constr.         20       5001        4.761        1.9     0.1
 Send X to PME         20       5001        6.982        2.8     0.2
 Comm. coord.          20       5001       76.154       30.8     2.3
 Neighbor search       20        501      130.554       52.8     4.0
 Force                 20       5001     1381.435      559.2    42.6
 Wait + Comm. F        20       5001      138.266       56.0     4.3
 PME mesh              12       5001      836.239      338.5    25.8
 Wait + Comm. X/F      12       5001      379.167      153.5    11.7
 Wait + Recv. PME F    20       5001        3.267        1.3     0.1
 Vsite spread          20      10002       14.383        5.8     0.4
 Write traj.           20         14        1.580        0.6     0.0
 Update                20       5001       42.473       17.2     1.3
 Constraints           20       5001       76.153       30.8     2.3
 Comm. energies        20       5001       50.078       20.3     1.5
 Rest                  20                  32.541       13.2     1.0
-----------------------------------------------------------------------
 Total                 32                3241.342     1312.0   100.0
-----------------------------------------------------------------------

	Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     41.000     41.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:   1572.657     82.914     52.693      0.455

_________
MDP:

integrator      = md			
dt              = 0.005			
nsteps          = 5000		
pbc             = xyz			
nstlist         = 10			
rlist           = 1.1			
ns_type         = grid			
coulombtype     = pme			
rcoulomb        = 1.1			
vdwtype         = cut-off		
rvdw            = 1.1			
tcoupl          = Berendsen		
tc-grps         = protein non-protein
tau-t           = 0.1 0.1		
ref-t           = 318 318 		
Pcoupl          = Berendsen			
pcoupltype      = isotropic		
tau-p           = 1.0			
ref-p           = 1.0 			
compressibility = 4.5e-5		
fourierspacing       =  0.14		
pme_order            =  4		
optimize_fft         =  yes		
ewald_rtol           =  1e-5		
gen_vel              =  yes		
gen_temp             =  318		
gen_seed             =  173529		
constraints          =  all-bonds	
constraint_algorithm =  lincs		
lincs_order          =  4		
nstxout             =  400		
nstvout             =  4000		
nstfout             =  0		
nstlog              =  50		
nstenergy           =  50		
energygrps          =  Protein non-protein	
__________________
END of message

2009/9/4 Erik Lindahl <lindahl at cbr.su.se>:
> Hi,
>
> On Sep 3, 2009, at 4:52 AM, Daniel Adriano Silva M wrote:0
>
>> Dear Gromacs users, (all related to GROMACS ver 4.0.x)
>>
>> I am facing a very strange problem on a recently acquired supermicro 8
>> XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
>> channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
>> DDR): I had been testing these nodes with one of our most familiar
>> protein model (49887 atoms: 2873 for protein and the rest for water
>> into a dodecahedron cell) which I known scales almost linearly until
>> 32 cores in a quad-core/node Opteron 2.4 GHz cluster.
>
> Without going deeper into the rest of the discussion, note that these the
> E5420 isn't a real quad-core, but a multi-chip-module with two dual cores
> connected by Intel's old/slow front side bus.
>
> In particular, this means all communication and memory operations have to
> share the narrow bus. Since PME involves more memory IO (charge
> spreading/interpolation) I'm not entirely surprised if the relative PME
> scaling doesn't match the direct space scaling. I don't think I've *ever*
> seen perfect scaling on these chips.
>
>
> The point of separate PME nodes is mainly to improve the high end scaling,
> since it reduces the number of MPI calls significantly. However, for the
> same reason it can obviously lead to load imbalance issues with fewer
> processors. You can always turn it off manually - the 12-cpu limit is very
> much heuristic.
>
> Finally, it will be virtually impossible to load balance effectively over
> e.g. 11 CPUs in your cluster. Remember, there are at least three different
> latency levels (cores on the same chip, cores on different chips in the same
> node, cores on different nodes), and all processes running on a node share
> the IB host adapter. Stick to multiples of 8 and try to have even sizes both
> for your direct space decomposition as well as the reciprocal space grid.
>
> Cheers,
>
> Erik
>
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use thewww interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>