[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x
Daniel Adriano Silva M
dadriano at gmail.com
Fri Oct 2 04:03:30 CEST 2009
Hi friends,
I want to update the status of this thread with good news. Last time I
said you that I was experimenting scaling problems with Gromacs 4.0.x
in a Centos cluster (el5) with infiniband Infinihost III Lx DDR. Now I
want to tell you that finally I made problems to disappear (discarded
infiniband problems, although I guess scaling can be better on
ConnectX). First, as Berk suggested the relationship between mesh and
cut-offs was corrected, but this does not made the trick, in fact the
MDP parameters that I showed on the previous message were intended to
test and improve scaling not to reach precision.
However after extensive testing I found that recompilation of mvapich2
and Gromacs 4.0.5 with Intel icc (version 11.1-046) made the trick.
Now we are obtaining around 40-50% better performance/core (EVEN IN
ONE CORE!!!), and scaling problems had gone. Now we can scale the
previous reported system to 32-cores pretty nicely, which in turn
contrasts against previous problems to scale to 12, 11 cores or even 8
cores with PME. It is important to note the I also proved gcc
(4.3.2-7) with optimizations: O3, march, and others, but I was not
able to reach the performance of icc.
In order to compare previous numbers I post below the new numbers of
the reruns of the same system with the new compilation and 4, 8, 16
and 32 cores, please comment. Also note that "some runs" vs
"total-CPU-numbers" are not so good for scaling but these are guesses
of mdrun and now definitively I can better it by specifying explicitly
the number of PME nodes.
BTW: I found some combinations of fourier-spacing/cut-offs that simply
lead to INMEDIATE crash of my system (ff/cut: 0.13/1.0; 0.13/1.1;
0.135/1.1, 0.16/1.2, 0.16/1.2 ), however other combinations RUN STABLY
by at least 40ns: (ff/cut: 0.12/1.0; 0.14/1.0; 0.14/1.1; 0.14/1.2;
0.15/12; 0.15/13), this appears to be more related to fourier-spacing
than to cut-offs. When there is such combinations with errors, mdrun
complains about "cell length out of the domain decomposition cell of
their charge group" and/or cannot-settle some water(s). Is this
behavior normal?, if it is: what is the cause of this
fuourier/cut-offs failures? Thanks.
Daniel Silva
_________
srun -n 32 mdrun-mvapich2 -v -dlb yes -deffnm full01
_________
4 cores:
Average load imbalance: 0.3 %
Part of the total run time spent waiting due to load imbalance: 0.2 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 4 501 29.329 11.7 1.3
Vsite constr. 4 5001 1.957 0.8 0.1
Comm. coord. 4 5001 11.370 4.5 0.5
Neighbor search 4 501 117.072 46.8 5.0
Force 4 5001 1409.451 563.0 60.4
Wait + Comm. F 4 5001 38.034 15.2 1.6
PME mesh 4 5001 582.642 232.7 25.0
Vsite spread 4 10002 6.085 2.4 0.3
Write traj. 4 14 0.954 0.4 0.0
Update 4 5001 42.713 17.1 1.8
Constraints 4 5001 61.328 24.5 2.6
Comm. energies 4 5001 2.005 0.8 0.1
Rest 4 30.367 12.1 1.3
-----------------------------------------------------------------------
Total 4 2333.306 932.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 233.000 233.000 100.0
3:53
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 276.468 15.263 9.272 2.588
_________
8 cores:
Average load imbalance: 0.6 %
Part of the total run time spent waiting due to load imbalance: 0.3 %
Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 8 501 44.609 17.8 1.8
Vsite constr. 8 5001 4.122 1.6 0.2
Comm. coord. 8 5001 32.280 12.9 1.3
Neighbor search 8 501 127.789 51.0 5.1
Force 8 5001 1393.396 556.0 55.2
Wait + Comm. F 8 5001 71.092 28.4 2.8
PME mesh 8 5001 694.855 277.3 27.5
Vsite spread 8 10002 7.755 3.1 0.3
Write traj. 8 14 1.060 0.4 0.0
Update 8 5001 43.490 17.4 1.7
Constraints 8 5001 70.224 28.0 2.8
Comm. energies 8 5001 3.471 1.4 0.1
Rest 8 31.993 12.8 1.3
-----------------------------------------------------------------------
Total 8 2526.134 1008.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 126.000 126.000 100.0
2:06
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 511.563 28.228 17.146 1.400
Finished mdrun on node 0 Thu Oct 1 20:50:28 2009
_________
16 cores:
Average load imbalance: 0.8 %
Part of the total run time spent waiting due to load imbalance: 0.6 %
Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
Average PME mesh/force load: 0.660
Part of the total run time spent waiting due to PP/PME imbalance: 10.3 %
NOTE: 10.3 % performance was lost because the PME nodes
had less work to do than the PP nodes.
You might want to decrease the number of PME nodes
or decrease the cut-off and the grid spacing.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 10 501 43.668 17.5 1.5
Vsite constr. 10 5001 2.534 1.0 0.1
Send X to PME 10 5001 6.808 2.7 0.2
Comm. coord. 10 5001 35.786 14.3 1.2
Neighbor search 10 501 124.724 50.0 4.2
Force 10 5001 1384.184 554.6 46.8
Wait + Comm. F 10 5001 74.121 29.7 2.5
PME mesh 6 5001 584.282 234.1 19.8
Wait + Comm. X/F 6 5001 523.711 209.9 17.7
Wait + Recv. PME F 10 5001 4.567 1.8 0.2
Vsite spread 10 10002 7.531 3.0 0.3
Write traj. 10 14 1.177 0.5 0.0
Update 10 5001 42.708 17.1 1.4
Constraints 10 5001 66.270 26.6 2.2
Comm. energies 10 5001 20.971 8.4 0.7
Rest 10 31.757 12.7 1.1
-----------------------------------------------------------------------
Total 16 2954.800 1184.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 74.000 74.000 100.0
1:14
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 871.450 45.941 29.195 0.822
_________
32 cores:
Average load imbalance: 1.4 %
Part of the total run time spent waiting due to load imbalance: 1.0 %
Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
Average PME mesh/force load: 0.901
Part of the total run time spent waiting due to PP/PME imbalance: 2.9 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 20 501 67.311 27.2 2.1
Vsite constr. 20 5001 4.761 1.9 0.1
Send X to PME 20 5001 6.982 2.8 0.2
Comm. coord. 20 5001 76.154 30.8 2.3
Neighbor search 20 501 130.554 52.8 4.0
Force 20 5001 1381.435 559.2 42.6
Wait + Comm. F 20 5001 138.266 56.0 4.3
PME mesh 12 5001 836.239 338.5 25.8
Wait + Comm. X/F 12 5001 379.167 153.5 11.7
Wait + Recv. PME F 20 5001 3.267 1.3 0.1
Vsite spread 20 10002 14.383 5.8 0.4
Write traj. 20 14 1.580 0.6 0.0
Update 20 5001 42.473 17.2 1.3
Constraints 20 5001 76.153 30.8 2.3
Comm. energies 20 5001 50.078 20.3 1.5
Rest 20 32.541 13.2 1.0
-----------------------------------------------------------------------
Total 32 3241.342 1312.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 41.000 41.000 100.0
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 1572.657 82.914 52.693 0.455
_________
MDP:
integrator = md
dt = 0.005
nsteps = 5000
pbc = xyz
nstlist = 10
rlist = 1.1
ns_type = grid
coulombtype = pme
rcoulomb = 1.1
vdwtype = cut-off
rvdw = 1.1
tcoupl = Berendsen
tc-grps = protein non-protein
tau-t = 0.1 0.1
ref-t = 318 318
Pcoupl = Berendsen
pcoupltype = isotropic
tau-p = 1.0
ref-p = 1.0
compressibility = 4.5e-5
fourierspacing = 0.14
pme_order = 4
optimize_fft = yes
ewald_rtol = 1e-5
gen_vel = yes
gen_temp = 318
gen_seed = 173529
constraints = all-bonds
constraint_algorithm = lincs
lincs_order = 4
nstxout = 400
nstvout = 4000
nstfout = 0
nstlog = 50
nstenergy = 50
energygrps = Protein non-protein
__________________
END of message
2009/9/4 Erik Lindahl <lindahl at cbr.su.se>:
> Hi,
>
> On Sep 3, 2009, at 4:52 AM, Daniel Adriano Silva M wrote:0
>
>> Dear Gromacs users, (all related to GROMACS ver 4.0.x)
>>
>> I am facing a very strange problem on a recently acquired supermicro 8
>> XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
>> channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
>> DDR): I had been testing these nodes with one of our most familiar
>> protein model (49887 atoms: 2873 for protein and the rest for water
>> into a dodecahedron cell) which I known scales almost linearly until
>> 32 cores in a quad-core/node Opteron 2.4 GHz cluster.
>
> Without going deeper into the rest of the discussion, note that these the
> E5420 isn't a real quad-core, but a multi-chip-module with two dual cores
> connected by Intel's old/slow front side bus.
>
> In particular, this means all communication and memory operations have to
> share the narrow bus. Since PME involves more memory IO (charge
> spreading/interpolation) I'm not entirely surprised if the relative PME
> scaling doesn't match the direct space scaling. I don't think I've *ever*
> seen perfect scaling on these chips.
>
>
> The point of separate PME nodes is mainly to improve the high end scaling,
> since it reduces the number of MPI calls significantly. However, for the
> same reason it can obviously lead to load imbalance issues with fewer
> processors. You can always turn it off manually - the 12-cpu limit is very
> much heuristic.
>
> Finally, it will be virtually impossible to load balance effectively over
> e.g. 11 CPUs in your cluster. Remember, there are at least three different
> latency levels (cores on the same chip, cores on different chips in the same
> node, cores on different nodes), and all processes running on a node share
> the IB host adapter. Stick to multiples of 8 and try to have even sizes both
> for your direct space decomposition as well as the reciprocal space grid.
>
> Cheers,
>
> Erik
>
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use thewww interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
More information about the gromacs.org_gmx-users
mailing list