[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x
Mark Abraham
Mark.Abraham at anu.edu.au
Fri Oct 2 04:39:08 CEST 2009
Daniel Adriano Silva M wrote:
> Hi friends,
>
> I want to update the status of this thread with good news. Last time I
> said you that I was experimenting scaling problems with Gromacs 4.0.x
> in a Centos cluster (el5) with infiniband Infinihost III Lx DDR. Now I
> want to tell you that finally I made problems to disappear (discarded
> infiniband problems, although I guess scaling can be better on
> ConnectX). First, as Berk suggested the relationship between mesh and
> cut-offs was corrected, but this does not made the trick, in fact the
> MDP parameters that I showed on the previous message were intended to
> test and improve scaling not to reach precision.
> However after extensive testing I found that recompilation of mvapich2
> and Gromacs 4.0.5 with Intel icc (version 11.1-046) made the trick.
> Now we are obtaining around 40-50% better performance/core (EVEN IN
> ONE CORE!!!), and scaling problems had gone. Now we can scale the
> previous reported system to 32-cores pretty nicely, which in turn
> contrasts against previous problems to scale to 12, 11 cores or even 8
> cores with PME. It is important to note the I also proved gcc
> (4.3.2-7) with optimizations: O3, march, and others, but I was not
> able to reach the performance of icc.
> In order to compare previous numbers I post below the new numbers of
> the reruns of the same system with the new compilation and 4, 8, 16
> and 32 cores, please comment. Also note that "some runs" vs
> "total-CPU-numbers" are not so good for scaling but these are guesses
> of mdrun and now definitively I can better it by specifying explicitly
> the number of PME nodes.
>
> BTW: I found some combinations of fourier-spacing/cut-offs that simply
> lead to INMEDIATE crash of my system (ff/cut: 0.13/1.0; 0.13/1.1;
> 0.135/1.1, 0.16/1.2, 0.16/1.2 ), however other combinations RUN STABLY
> by at least 40ns: (ff/cut: 0.12/1.0; 0.14/1.0; 0.14/1.1; 0.14/1.2;
> 0.15/12; 0.15/13), this appears to be more related to fourier-spacing
> than to cut-offs. When there is such combinations with errors, mdrun
> complains about "cell length out of the domain decomposition cell of
> their charge group" and/or cannot-settle some water(s). Is this
> behavior normal?, if it is: what is the cause of this
> fuourier/cut-offs failures? Thanks.
A likely hypothesis is that the initial conditions of your system are
such that it will not always equilibrate reliably. You are using
gen_vel=yes, which means velocities are sampled from a suitable
distribution the start of the run, but a set of such velocities need not
produce an equilibrium ensemble, nor indeed a well-conditioned
integration. Having found a set of conditions that don't "explode",
normal procedure is to let it run for a while so they stabilise. A
comparative analysis between such PME parameters as you describe above
is better formed by taking such an equilibrated run as input to grompp
(or perhaps just using a .cpt with a new .tpr) and *not* generating
velocities. Now since the starting conditions belong to a reasonable
ensemble, and the perturbation to the ensemble of such parameter
variation should be minor, you ought to see all such runs finish
successfully.
(I have just today submitted an as-yet unconfirmed Bugzilla report #350
for pme_order = 6 where minor variation over fourier_n[xyz] led to some
irreproducible problems. You used pme_order = 4, which I observed worked
correctly on 20 fourier_n[xyz] combinations. It seems unlikely that our
observations are related, because of the above issue.)
Mark
> srun -n 32 mdrun-mvapich2 -v -dlb yes -deffnm full01
> _________
> 4 cores:
> Average load imbalance: 0.3 %
> Part of the total run time spent waiting due to load imbalance: 0.2 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 4 501 29.329 11.7 1.3
> Vsite constr. 4 5001 1.957 0.8 0.1
> Comm. coord. 4 5001 11.370 4.5 0.5
> Neighbor search 4 501 117.072 46.8 5.0
> Force 4 5001 1409.451 563.0 60.4
> Wait + Comm. F 4 5001 38.034 15.2 1.6
> PME mesh 4 5001 582.642 232.7 25.0
> Vsite spread 4 10002 6.085 2.4 0.3
> Write traj. 4 14 0.954 0.4 0.0
> Update 4 5001 42.713 17.1 1.8
> Constraints 4 5001 61.328 24.5 2.6
> Comm. energies 4 5001 2.005 0.8 0.1
> Rest 4 30.367 12.1 1.3
> -----------------------------------------------------------------------
> Total 4 2333.306 932.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 233.000 233.000 100.0
> 3:53
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 276.468 15.263 9.272 2.588
>
>
> _________
> 8 cores:
> Average load imbalance: 0.6 %
> Part of the total run time spent waiting due to load imbalance: 0.3 %
> Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 8 501 44.609 17.8 1.8
> Vsite constr. 8 5001 4.122 1.6 0.2
> Comm. coord. 8 5001 32.280 12.9 1.3
> Neighbor search 8 501 127.789 51.0 5.1
> Force 8 5001 1393.396 556.0 55.2
> Wait + Comm. F 8 5001 71.092 28.4 2.8
> PME mesh 8 5001 694.855 277.3 27.5
> Vsite spread 8 10002 7.755 3.1 0.3
> Write traj. 8 14 1.060 0.4 0.0
> Update 8 5001 43.490 17.4 1.7
> Constraints 8 5001 70.224 28.0 2.8
> Comm. energies 8 5001 3.471 1.4 0.1
> Rest 8 31.993 12.8 1.3
> -----------------------------------------------------------------------
> Total 8 2526.134 1008.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 126.000 126.000 100.0
> 2:06
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 511.563 28.228 17.146 1.400
> Finished mdrun on node 0 Thu Oct 1 20:50:28 2009
>
>
> _________
> 16 cores:
> Average load imbalance: 0.8 %
> Part of the total run time spent waiting due to load imbalance: 0.6 %
> Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
> Average PME mesh/force load: 0.660
> Part of the total run time spent waiting due to PP/PME imbalance: 10.3 %
>
> NOTE: 10.3 % performance was lost because the PME nodes
> had less work to do than the PP nodes.
> You might want to decrease the number of PME nodes
> or decrease the cut-off and the grid spacing.
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 10 501 43.668 17.5 1.5
> Vsite constr. 10 5001 2.534 1.0 0.1
> Send X to PME 10 5001 6.808 2.7 0.2
> Comm. coord. 10 5001 35.786 14.3 1.2
> Neighbor search 10 501 124.724 50.0 4.2
> Force 10 5001 1384.184 554.6 46.8
> Wait + Comm. F 10 5001 74.121 29.7 2.5
> PME mesh 6 5001 584.282 234.1 19.8
> Wait + Comm. X/F 6 5001 523.711 209.9 17.7
> Wait + Recv. PME F 10 5001 4.567 1.8 0.2
> Vsite spread 10 10002 7.531 3.0 0.3
> Write traj. 10 14 1.177 0.5 0.0
> Update 10 5001 42.708 17.1 1.4
> Constraints 10 5001 66.270 26.6 2.2
> Comm. energies 10 5001 20.971 8.4 0.7
> Rest 10 31.757 12.7 1.1
> -----------------------------------------------------------------------
> Total 16 2954.800 1184.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 74.000 74.000 100.0
> 1:14
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 871.450 45.941 29.195 0.822
>
> _________
> 32 cores:
> Average load imbalance: 1.4 %
> Part of the total run time spent waiting due to load imbalance: 1.0 %
> Steps where the load balancing was limited by -rdd, -rcon and/or
> -dds: X 0 % Y 0 %
> Average PME mesh/force load: 0.901
> Part of the total run time spent waiting due to PP/PME imbalance: 2.9 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 20 501 67.311 27.2 2.1
> Vsite constr. 20 5001 4.761 1.9 0.1
> Send X to PME 20 5001 6.982 2.8 0.2
> Comm. coord. 20 5001 76.154 30.8 2.3
> Neighbor search 20 501 130.554 52.8 4.0
> Force 20 5001 1381.435 559.2 42.6
> Wait + Comm. F 20 5001 138.266 56.0 4.3
> PME mesh 12 5001 836.239 338.5 25.8
> Wait + Comm. X/F 12 5001 379.167 153.5 11.7
> Wait + Recv. PME F 20 5001 3.267 1.3 0.1
> Vsite spread 20 10002 14.383 5.8 0.4
> Write traj. 20 14 1.580 0.6 0.0
> Update 20 5001 42.473 17.2 1.3
> Constraints 20 5001 76.153 30.8 2.3
> Comm. energies 20 5001 50.078 20.3 1.5
> Rest 20 32.541 13.2 1.0
> -----------------------------------------------------------------------
> Total 32 3241.342 1312.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 41.000 41.000 100.0
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 1572.657 82.914 52.693 0.455
>
> _________
> MDP:
>
> integrator = md
> dt = 0.005
> nsteps = 5000
> pbc = xyz
> nstlist = 10
> rlist = 1.1
> ns_type = grid
> coulombtype = pme
> rcoulomb = 1.1
> vdwtype = cut-off
> rvdw = 1.1
> tcoupl = Berendsen
> tc-grps = protein non-protein
> tau-t = 0.1 0.1
> ref-t = 318 318
> Pcoupl = Berendsen
> pcoupltype = isotropic
> tau-p = 1.0
> ref-p = 1.0
> compressibility = 4.5e-5
> fourierspacing = 0.14
> pme_order = 4
> optimize_fft = yes
> ewald_rtol = 1e-5
> gen_vel = yes
> gen_temp = 318
> gen_seed = 173529
> constraints = all-bonds
> constraint_algorithm = lincs
> lincs_order = 4
> nstxout = 400
> nstvout = 4000
> nstfout = 0
> nstlog = 50
> nstenergy = 50
> energygrps = Protein non-protein
> __________________
> END of message
>
> 2009/9/4 Erik Lindahl <lindahl at cbr.su.se>:
>> Hi,
>>
>> On Sep 3, 2009, at 4:52 AM, Daniel Adriano Silva M wrote:0
>>
>>> Dear Gromacs users, (all related to GROMACS ver 4.0.x)
>>>
>>> I am facing a very strange problem on a recently acquired supermicro 8
>>> XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
>>> channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
>>> DDR): I had been testing these nodes with one of our most familiar
>>> protein model (49887 atoms: 2873 for protein and the rest for water
>>> into a dodecahedron cell) which I known scales almost linearly until
>>> 32 cores in a quad-core/node Opteron 2.4 GHz cluster.
>> Without going deeper into the rest of the discussion, note that these the
>> E5420 isn't a real quad-core, but a multi-chip-module with two dual cores
>> connected by Intel's old/slow front side bus.
>>
>> In particular, this means all communication and memory operations have to
>> share the narrow bus. Since PME involves more memory IO (charge
>> spreading/interpolation) I'm not entirely surprised if the relative PME
>> scaling doesn't match the direct space scaling. I don't think I've *ever*
>> seen perfect scaling on these chips.
>>
>>
>> The point of separate PME nodes is mainly to improve the high end scaling,
>> since it reduces the number of MPI calls significantly. However, for the
>> same reason it can obviously lead to load imbalance issues with fewer
>> processors. You can always turn it off manually - the 12-cpu limit is very
>> much heuristic.
>>
>> Finally, it will be virtually impossible to load balance effectively over
>> e.g. 11 CPUs in your cluster. Remember, there are at least three different
>> latency levels (cores on the same chip, cores on different chips in the same
>> node, cores on different nodes), and all processes running on a node share
>> the IB host adapter. Stick to multiples of 8 and try to have even sizes both
>> for your direct space decomposition as well as the reciprocal space grid.
>>
>> Cheers,
>>
>> Erik
>>
>> _______________________________________________
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/search before posting!
>> Please don't post (un)subscribe requests to the list. Use thewww interface
>> or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>>
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
More information about the gromacs.org_gmx-users
mailing list