[gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x
Daniel Adriano Silva M
dadriano at gmail.com
Thu Sep 3 04:52:47 CEST 2009
Dear Gromacs users, (all related to GROMACS ver 4.0.x)
I am facing a very strange problem on a recently acquired supermicro 8
XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory
channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx
DDR): I had been testing these nodes with one of our most familiar
protein model (49887 atoms: 2873 for protein and the rest for water
into a dodecahedron cell) which I known scales almost linearly until
32 cores in a quad-core/node Opteron 2.4 GHz cluster. Now, with our
recently acquired nodes I have severe imbalance PME/PP ratios (from
20% and up). At the beginning I think that this problem was related to
Infiniband latency problems, but recently I made a test that gave me a
big surprise: since my model scales very well to 8 cores I spreaded it
to 8 cores into four machines and the performance was the same than in
a single node, which in turns suggests that the problem could be
caused by a different reason that latency. After several tests I
realized that the problem arises when the process is divided into PME
and PP nodes, even into a single node!!!, it is to say:
-if for a short job I do (it is exactly the same for a long run):
srun -n8 /home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun
-v -dlb yes -deffnm FULL01/full01
Average load imbalance: 0.7 %
Part of the total run time spent waiting due to load imbalance: 0.2 %
Steps where the load balancing was limited by -rdd, -rcon and/or
-dds: X 0 % Y 0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 8 101 19.123 7.6 2.5
Vsite constr. 8 1001 2.189 0.9 0.3
Comm. coord. 8 1001 5.810 2.3 0.8
Neighbor search 8 101 51.432 20.4 6.7
Force 8 1001 250.938 99.5 32.7
Wait + Comm. F 8 1001 15.064 6.0 2.0
PME mesh 8 1001 337.946 133.9 44.1
Vsite spread 8 2002 2.991 1.2 0.4
Write traj. 8 2 0.604 0.2 0.1
Update 8 1001 17.854 7.1 2.3
Constraints 8 1001 35.782 14.2 4.7
Comm. energies 8 1001 1.407 0.6 0.2
Rest 8 25.889 10.3 3.4
-----------------------------------------------------------------------
Total 8 767.030 304.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 38.000 38.000 100.0
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 254.161 14.534 11.380 2.109
which in turns reflects that there are not separation between PME and
PP and scaling is almost lineal compared with 1 processor. But if I
force PME, and use exactly the same number of processors :
srun -n8 --cpu_bind=rank
/home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v
-dlb yes -npme 3 -deffnm FULL01/full01
Average load imbalance: 0.5 %
Part of the total run time spent waiting due to load imbalance: 0.2 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
Average PME mesh/force load: 1.901
Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %
NOTE: 23.9 % performance was lost because the PME nodes
had more work to do than the PP nodes.
You might want to increase the number of PME nodes
or increase the cut-off and the grid spacing.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 5 101 14.660 5.9 1.4
Vsite constr. 5 1001 1.440 0.6 0.1
Send X to PME 5 1001 4.601 1.9 0.5
Comm. coord. 5 1001 3.229 1.3 0.3
Neighbor search 5 101 48.143 19.4 4.8
Force 5 1001 252.340 101.8 25.0
Wait + Comm. F 5 1001 8.845 3.6 0.9
PME mesh 3 1001 304.447 122.9 30.1
Wait + Comm. X/F 3 1001 73.389 29.6 7.3
Wait + Recv. PME F 5 1001 219.552 88.6 21.7
Vsite spread 5 2002 3.828 1.5 0.4
Write traj. 5 2 0.555 0.2 0.1
Update 5 1001 17.765 7.2 1.8
Constraints 5 1001 31.203 12.6 3.1
Comm. energies 5 1001 1.977 0.8 0.2
Rest 5 25.105 10.1 2.5
-----------------------------------------------------------------------
Total 8 1011.079 408.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 51.000 51.000 100.0
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 189.377 10.354 8.479 2.831
As you can see I got a very bad performance, and this is also true if
I do not specify the number of PME nodes and spread the job into 11
processors (and goes worst with more processors), which gives me:
srun -n11 --cpu_bind=rank
/home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v
-dlb yes -deffnm FULL01/full01
NOTE: 11.9 % performance was lost because the PME nodes
had more work to do than the PP nodes.
You might want to increase the number of PME nodes
or increase the cut-off and the grid spacing.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 6 101 15.450 6.2 1.6
Vsite constr. 6 1001 1.486 0.6 0.2
Send X to PME 6 1001 1.154 0.5 0.1
Comm. coord. 6 1001 3.832 1.5 0.4
Neighbor search 6 101 47.950 19.1 5.1
Force 6 1001 250.202 99.7 26.7
Wait + Comm. F 6 1001 10.022 4.0 1.1
PME mesh 5 1001 314.841 125.5 33.6
Wait + Comm. X/F 5 1001 111.565 44.5 11.9
Wait + Recv. PME F 6 1001 102.240 40.8 10.9
Vsite spread 6 2002 2.317 0.9 0.2
Write traj. 6 2 0.567 0.2 0.1
Update 6 1001 17.849 7.1 1.9
Constraints 6 1001 31.215 12.4 3.3
Comm. energies 6 1001 2.274 0.9 0.2
Rest 6 25.283 10.1 2.7
-----------------------------------------------------------------------
Total 11 938.249 374.0 100.0
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 34.000 34.000 100.0
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 284.388 15.963 12.719 1.887
I had tried everything that came in my mind, from modify npme to
cpu_affinity and mdp cut-offs and fourierspacing, also recompiling
things and tried different versions of fftw. Please advice me with any
ideas, trps to test or any tips. My mdp options for this runs were:
integrator = md
dt = 0.005
nsteps = 1000
pbc = xyz
nstlist = 10
rlist = 1.0
ns_type = grid
coulombtype = pme
rcoulomb = 1.0
vdwtype = cut-off
rvdw = 1.0
tcoupl = Berendsen
tc-grps = protein non-protein
tau-t = 0.1 0.1
ref-t = 318 318
Pcoupl = Berendsen
pcoupltype = isotropic
tau-p = 1.0
ref-p = 1.0
compressibility = 4.5e-5
fourierspacing = 0.16
pme_order = 4
optimize_fft = yes
ewald_rtol = 1e-5
gen_vel = yes
gen_temp = 318
gen_seed = 173529
constraints = all-bonds
constraint_algorithm = lincs
lincs_order = 4
nstxout = 5000
nstvout = 5000
nstfout = 0
nstlog = 5000
nstenergy = 5000
energygrps = Protein non-protein
Thanks.
Daniel Silva
More information about the gromacs.org_gmx-users
mailing list