[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x

Fri Oct 2 04:39:08 CEST 2009

Daniel Adriano Silva M wrote:
> Hi friends,
> 
> I want to update the status of this thread with good news. Last time I
> said you that I was experimenting scaling problems with Gromacs 4.0.x
> in a Centos cluster (el5) with infiniband Infinihost III Lx DDR. Now I
> want to tell you that finally I made problems to disappear (discarded
> infiniband problems, although I guess scaling can be better on
> ConnectX). First, as Berk suggested the relationship between mesh and
> cut-offs was corrected, but this does not made the trick, in fact the
> MDP parameters that I showed on the previous message were intended to
> test and improve scaling not to reach precision.
> However after extensive testing I found that recompilation of mvapich2
> and Gromacs 4.0.5 with Intel icc (version 11.1-046) made the trick.
> Now we are obtaining around 40-50% better performance/core (EVEN IN
> ONE CORE!!!), and scaling problems had gone. Now we can scale the
> previous reported system to 32-cores pretty nicely, which in turn
> contrasts against previous problems to scale to 12, 11 cores or even 8
> cores with PME. It is important to note the I also proved gcc
> (4.3.2-7) with optimizations: O3, march, and others, but  I was not
> able to reach the performance of icc.
> In order to compare previous numbers I post below the new numbers of
> the reruns of the same system with the new compilation and 4, 8, 16
> and 32 cores, please comment. Also note that "some runs" vs
> "total-CPU-numbers" are not so good for scaling but these are guesses
> of mdrun and now definitively I can better it by specifying explicitly
> the number of PME nodes.
> 
> BTW: I found some combinations of fourier-spacing/cut-offs that simply
> lead to INMEDIATE crash of my system (ff/cut: 0.13/1.0; 0.13/1.1;
> 0.135/1.1, 0.16/1.2, 0.16/1.2 ), however other combinations RUN STABLY
> by at least 40ns: (ff/cut: 0.12/1.0; 0.14/1.0; 0.14/1.1; 0.14/1.2;
> 0.15/12; 0.15/13), this appears to be more related to fourier-spacing
> than to cut-offs. When there is such combinations with errors, mdrun
> complains about "cell length out of the domain decomposition cell of
> their charge group" and/or cannot-settle some water(s). Is this
> behavior normal?, if it is: what is the cause of this
> fuourier/cut-offs failures? Thanks.

A likely hypothesis is that the initial conditions of your system are 
such that it will not always equilibrate reliably. You are using 
gen_vel=yes, which means velocities are sampled from a suitable 
distribution the start of the run, but a set of such velocities need not 
produce an equilibrium ensemble, nor indeed a well-conditioned 
integration. Having found a set of conditions that don't "explode", 
normal procedure is to let it run for a while so they stabilise. A 
comparative analysis between such PME parameters as you describe above 
is better formed by taking such an equilibrated run as input to grompp 
(or perhaps just using a .cpt with a new .tpr) and *not* generating 
velocities. Now since the starting conditions belong to a reasonable 
ensemble, and the perturbation to the ensemble of such parameter 
variation should be minor, you ought to see all such runs finish 
successfully.

(I have just today submitted an as-yet unconfirmed Bugzilla report #350 
for pme_order = 6 where minor variation over fourier_n[xyz] led to some 
irreproducible problems. You used pme_order = 4, which I observed worked 
correctly on 20 fourier_n[xyz] combinations. It seems unlikely that our 
observations are related, because of the above issue.)

Mark

> srun -n 32  mdrun-mvapich2 -v -dlb yes -deffnm full01
> _________
> 4 cores:
>  Average load imbalance: 0.3 %
>  Part of the total run time spent waiting due to load imbalance: 0.2 %
>  Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
> 
> 
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
>  Domain decomp.         4        501       29.329       11.7     1.3
>  Vsite constr.          4       5001        1.957        0.8     0.1
>  Comm. coord.           4       5001       11.370        4.5     0.5
>  Neighbor search        4        501      117.072       46.8     5.0
>  Force                  4       5001     1409.451      563.0    60.4
>  Wait + Comm. F         4       5001       38.034       15.2     1.6
>  PME mesh               4       5001      582.642      232.7    25.0
>  Vsite spread           4      10002        6.085        2.4     0.3
>  Write traj.            4         14        0.954        0.4     0.0
>  Update                 4       5001       42.713       17.1     1.8
>  Constraints            4       5001       61.328       24.5     2.6
>  Comm. energies         4       5001        2.005        0.8     0.1
>  Rest                   4                  30.367       12.1     1.3
> -----------------------------------------------------------------------
>  Total                  4                2333.306      932.0   100.0
> -----------------------------------------------------------------------
> 
>         Parallel run - timing based on wallclock.
> 
>                NODE (s)   Real (s)      (%)
>        Time:    233.000    233.000    100.0
>                        3:53
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:    276.468     15.263      9.272      2.588
> 
> 
> _________
> 8 cores:
>  Average load imbalance: 0.6 %
>  Part of the total run time spent waiting due to load imbalance: 0.3 %
>  Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
> 
> 
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
>  Domain decomp.         8        501       44.609       17.8     1.8
>  Vsite constr.          8       5001        4.122        1.6     0.2
>  Comm. coord.           8       5001       32.280       12.9     1.3
>  Neighbor search        8        501      127.789       51.0     5.1
>  Force                  8       5001     1393.396      556.0    55.2
>  Wait + Comm. F         8       5001       71.092       28.4     2.8
>  PME mesh               8       5001      694.855      277.3    27.5
>  Vsite spread           8      10002        7.755        3.1     0.3
>  Write traj.            8         14        1.060        0.4     0.0
>  Update                 8       5001       43.490       17.4     1.7
>  Constraints            8       5001       70.224       28.0     2.8
>  Comm. energies         8       5001        3.471        1.4     0.1
>  Rest                   8                  31.993       12.8     1.3
> -----------------------------------------------------------------------
>  Total                  8                2526.134     1008.0   100.0
> -----------------------------------------------------------------------
> 
>         Parallel run - timing based on wallclock.
> 
>                NODE (s)   Real (s)      (%)
>        Time:    126.000    126.000    100.0
>                        2:06
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:    511.563     28.228     17.146      1.400
> Finished mdrun on node 0 Thu Oct  1 20:50:28 2009
> 
> 
> _________
> 16 cores:
>  Average load imbalance: 0.8 %
>  Part of the total run time spent waiting due to load imbalance: 0.6 %
>  Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
>  Average PME mesh/force load: 0.660
>  Part of the total run time spent waiting due to PP/PME imbalance: 10.3 %
> 
> NOTE: 10.3 % performance was lost because the PME nodes
>       had less work to do than the PP nodes.
>       You might want to decrease the number of PME nodes
>       or decrease the cut-off and the grid spacing.
> 
> 
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
>  Domain decomp.        10        501       43.668       17.5     1.5
>  Vsite constr.         10       5001        2.534        1.0     0.1
>  Send X to PME         10       5001        6.808        2.7     0.2
>  Comm. coord.          10       5001       35.786       14.3     1.2
>  Neighbor search       10        501      124.724       50.0     4.2
>  Force                 10       5001     1384.184      554.6    46.8
>  Wait + Comm. F        10       5001       74.121       29.7     2.5
>  PME mesh               6       5001      584.282      234.1    19.8
>  Wait + Comm. X/F       6       5001      523.711      209.9    17.7
>  Wait + Recv. PME F    10       5001        4.567        1.8     0.2
>  Vsite spread          10      10002        7.531        3.0     0.3
>  Write traj.           10         14        1.177        0.5     0.0
>  Update                10       5001       42.708       17.1     1.4
>  Constraints           10       5001       66.270       26.6     2.2
>  Comm. energies        10       5001       20.971        8.4     0.7
>  Rest                  10                  31.757       12.7     1.1
> -----------------------------------------------------------------------
>  Total                 16                2954.800     1184.0   100.0
> -----------------------------------------------------------------------
> 
>         Parallel run - timing based on wallclock.
> 
>                NODE (s)   Real (s)      (%)
>        Time:     74.000     74.000    100.0
>                        1:14
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:    871.450     45.941     29.195      0.822
> 
> _________
> 32 cores:
>  Average load imbalance: 1.4 %
>  Part of the total run time spent waiting due to load imbalance: 1.0 %
>  Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
>  Average PME mesh/force load: 0.901
>  Part of the total run time spent waiting due to PP/PME imbalance: 2.9 %
> 
> 
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
>  Domain decomp.        20        501       67.311       27.2     2.1
>  Vsite constr.         20       5001        4.761        1.9     0.1
>  Send X to PME         20       5001        6.982        2.8     0.2
>  Comm. coord.          20       5001       76.154       30.8     2.3
>  Neighbor search       20        501      130.554       52.8     4.0
>  Force                 20       5001     1381.435      559.2    42.6
>  Wait + Comm. F        20       5001      138.266       56.0     4.3
>  PME mesh              12       5001      836.239      338.5    25.8
>  Wait + Comm. X/F      12       5001      379.167      153.5    11.7
>  Wait + Recv. PME F    20       5001        3.267        1.3     0.1
>  Vsite spread          20      10002       14.383        5.8     0.4
>  Write traj.           20         14        1.580        0.6     0.0
>  Update                20       5001       42.473       17.2     1.3
>  Constraints           20       5001       76.153       30.8     2.3
>  Comm. energies        20       5001       50.078       20.3     1.5
>  Rest                  20                  32.541       13.2     1.0
> -----------------------------------------------------------------------
>  Total                 32                3241.342     1312.0   100.0
> -----------------------------------------------------------------------
> 
> 	Parallel run - timing based on wallclock.
> 
>                NODE (s)   Real (s)      (%)
>        Time:     41.000     41.000    100.0
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:   1572.657     82.914     52.693      0.455
> 
> _________
> MDP:
> 
> integrator      = md			
> dt              = 0.005			
> nsteps          = 5000		
> pbc             = xyz			
> nstlist         = 10			
> rlist           = 1.1			
> ns_type         = grid			
> coulombtype     = pme			
> rcoulomb        = 1.1			
> vdwtype         = cut-off		
> rvdw            = 1.1			
> tcoupl          = Berendsen		
> tc-grps         = protein non-protein
> tau-t           = 0.1 0.1		
> ref-t           = 318 318 		
> Pcoupl          = Berendsen			
> pcoupltype      = isotropic		
> tau-p           = 1.0			
> ref-p           = 1.0 			
> compressibility = 4.5e-5		
> fourierspacing       =  0.14		
> pme_order            =  4		
> optimize_fft         =  yes		
> ewald_rtol           =  1e-5		
> gen_vel              =  yes		
> gen_temp             =  318		
> gen_seed             =  173529		
> constraints          =  all-bonds	
> constraint_algorithm =  lincs		
> lincs_order          =  4		
> nstxout             =  400		
> nstvout             =  4000		
> nstfout             =  0		
> nstlog              =  50		
> nstenergy           =  50		
> energygrps          =  Protein non-protein	
> __________________
> END of message
> 
> 2009/9/4 Erik Lindahl <lindahl at cbr.su.se>:
>> Hi,
>>
>> On Sep 3, 2009, at 4:52 AM, Daniel Adriano Silva M wrote:0
>>
>>> Dear Gromacs users, (all related to GROMACS ver 4.0.x)
>>>
>>> I am facing a very strange problem on a recently acquired supermicro 8
>>> XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
>>> channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
>>> DDR): I had been testing these nodes with one of our most familiar
>>> protein model (49887 atoms: 2873 for protein and the rest for water
>>> into a dodecahedron cell) which I known scales almost linearly until
>>> 32 cores in a quad-core/node Opteron 2.4 GHz cluster.
>> Without going deeper into the rest of the discussion, note that these the
>> E5420 isn't a real quad-core, but a multi-chip-module with two dual cores
>> connected by Intel's old/slow front side bus.
>>
>> In particular, this means all communication and memory operations have to
>> share the narrow bus. Since PME involves more memory IO (charge
>> spreading/interpolation) I'm not entirely surprised if the relative PME
>> scaling doesn't match the direct space scaling. I don't think I've *ever*
>> seen perfect scaling on these chips.
>>
>>
>> The point of separate PME nodes is mainly to improve the high end scaling,
>> since it reduces the number of MPI calls significantly. However, for the
>> same reason it can obviously lead to load imbalance issues with fewer
>> processors. You can always turn it off manually - the 12-cpu limit is very
>> much heuristic.
>>
>> Finally, it will be virtually impossible to load balance effectively over
>> e.g. 11 CPUs in your cluster. Remember, there are at least three different
>> latency levels (cores on the same chip, cores on different chips in the same
>> node, cores on different nodes), and all processes running on a node share
>> the IB host adapter. Stick to multiples of 8 and try to have even sizes both
>> for your direct space decomposition as well as the reciprocal space grid.
>>
>> Cheers,
>>
>> Erik
>>
>> _______________________________________________
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/search before posting!
>> Please don't post (un)subscribe requests to the list. Use thewww interface
>> or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>>
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>