[gmx-users] Poor load balancing
Deniz KARASU
karasudeniz at gmail.com
Mon Feb 15 15:13:05 CET 2010
Hi All,
I'm trying to d.lzm gromacs benchmarks with 64 node machine, but dynamic
load balancing performance is very low.
Any suggestion will be of great help.
Thanks.
Deniz KARASU
Log file opened on Sat Feb 13 17:23:37 2010
Host: d077.uybhm.itu.edu.tr pid: 20157 nodeid: 0 nnodes: 64
The Gromacs distribution was built Thu Sep 10 11:45:26 EEST 2009 by
mds.fatma at lnode1.uybhm.itu.edu.tr (Linux
2.6.18-53.1.14.el5_lustre.1.6.5.1smp x86_64)
:-) G R O M A C S (-:
Good ROcking Metal Altar for Chronical Sinners
:-) VERSION 4.0.5 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) /AKDENIZ/HOME005/users/mds.fatma/rs/software/bin/mdrun (-:
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
parameters of the run:
integrator = md
nsteps = 5000
init_step = 0
ns_type = Grid
nstlist = 5
ndelta = 2
nstcomm = 1
comm_mode = Linear
nstlog = 0
nstxout = 0
nstvout = 0
nstfout = 0
nstenergy = 0
nstxtcout = 0
init_t = 0
delta_t = 0.004
xtcprec = 1000
nkx = 0
nky = 0
nkz = 0
pme_order = 4
ewald_rtol = 1e-05
ewald_geometry = 0
epsilon_surface = 0
optimize_fft = FALSE
ePBC = xyz
bPeriodicMols = FALSE
bContinuation = FALSE
bShakeSOR = FALSE
etc = Berendsen
epc = No
epctype = Isotropic
tau_p = 1
ref_p (3x3):
ref_p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref_p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref_p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compress (3x3):
compress[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compress[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compress[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
refcoord_scaling = No
posres_com (3):
posres_com[0]= 0.00000e+00
posres_com[1]= 0.00000e+00
posres_com[2]= 0.00000e+00
posres_comB (3):
posres_comB[0]= 0.00000e+00
posres_comB[1]= 0.00000e+00
posres_comB[2]= 0.00000e+00
andersen_seed = 815131
rlist = 0.9
rtpi = 0.05
coulombtype = Cut-off
rcoulomb_switch = 0
rcoulomb = 1.4
vdwtype = Cut-off
rvdw_switch = 0
rvdw = 1.4
epsilon_r = 1
epsilon_rf = 1
tabext = 1
implicit_solvent = No
gb_algorithm = Still
gb_epsilon_solvent = 80
nstgbradii = 1
rgbradii = 2
gb_saltconc = 0
gb_obc_alpha = 1
gb_obc_beta = 0.8
gb_obc_gamma = 4.85
sa_surface_tension = 2.092
DispCorr = No
free_energy = no
init_lambda = 0
sc_alpha = 0
sc_power = 0
sc_sigma = 0.3
delta_lambda = 0
nwall = 0
wall_type = 9-3
wall_atomtype[0] = -1
wall_atomtype[1] = -1
wall_density[0] = 0
wall_density[1] = 0
wall_ewald_zfac = 3
pull = no
disre = No
disre_weighting = Conservative
disre_mixed = FALSE
dr_fc = 1000
dr_tau = 0
nstdisreout = 100
orires_fc = 0
orires_tau = 0
nstorireout = 100
dihre-fc = 1000
em_stepsize = 0.01
em_tol = 10
niter = 20
fc_stepsize = 0
nstcgsteep = 1000
nbfgscorr = 10
ConstAlg = Lincs
shake_tol = 0.0001
lincs_order = 4
lincs_warnangle = 30
lincs_iter = 1
bd_fric = 0
ld_seed = 1993
cos_accel = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 2636.83 23.9984 42933.2
ref_t: 300 300 300
tau_t: 0.1 0.1 0.1
anneal: No No No
ann_npoints: 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp_flags[ 0]: 0
efield-x:
n = 0
efield-xt:
n = 0
efield-y:
n = 0
efield-yt:
n = 0
efield-z:
n = 0
efield-zt:
n = 0
bQMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
scalefactor = 1
qm_opts:
ngQM = 0
Initializing Domain Decomposition on 64 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.571 nm, LJ-14, atoms 439 442
multi-body bonded interactions: 0.571 nm, Proper Dih., atoms 439 442
Minimum cell size due to bonded interactions: 0.628 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.825 nm
Estimated maximum distance required for P-LINCS: 0.825 nm
This distance will limit the DD cell size, you can override this with -rcon
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 64 cells with a minimum initial size of 1.031 nm
The maximum allowed number of cells is: X 5 Y 5 Z 4
Domain decomposition grid 4 x 4 x 4, separate PME nodes 0
Domain decomposition nodeid 0, coordinates 0 0 0
Using two step summing over 11 groups of on average 5.8 processes
Table routines are used for coulomb: FALSE
Table routines are used for vdw: FALSE
Cut-off's: NS: 0.9 Coulomb: 1.4 LJ: 1.4
System total charge: 0.000
Generated table with 1200 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1200 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1200 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Enabling SPC water optimization for 7156 molecules.
Configuring nonbonded kernels...
Testing x86_64 SSE support... present.
Removing pbc first time
Initializing Parallel LINear Constraint Solver
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------
The number of constraints is 1407
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration
117 constraints are involved in constraint triangles,
will apply an additional matrix expansion of order 4 for couplings
between constraints inside triangles
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------
Linking all bonded interactions to atoms
There are 379 inter charge-group virtual sites,
will an extra communication step for selected coordinates and forces
The initial number of communication pulses is: X 1 Y 1 Z 2
The initial domain decomposition cell size is: X 1.43 nm Y 1.43 nm Z 1.24 nm
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.400 nm
two-body bonded interactions (-rdd) 1.400 nm
multi-body bonded interactions (-rdd) 1.239 nm
virtual site constructions (-rcon) 1.239 nm
atoms separated by up to 5 constraints (-rcon) 1.239 nm
When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2 Y 2 Z 2
The minimum size for domain decomposition cells is 0.905 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.63 Y 0.63 Z 0.73
The maximum allowed distance for charge groups involved in interactions is:
non-bonded interactions 1.400 nm
two-body bonded interactions (-rdd) 1.400 nm
multi-body bonded interactions (-rdd) 0.905 nm
virtual site constructions (-rcon) 0.905 nm
atoms separated by up to 5 constraints (-rcon) 0.905 nm
Making 3D domain decomposition grid 4 x 4 x 4, home cell index 0 0 0
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, J. P. M. Postma, A. DiNola and J. R. Haak
Molecular dynamics with coupling to an external bath
J. Chem. Phys. 81 (1984) pp. 3684-3690
-------- -------- --- Thank You --- -------- --------
There are: 22824 Atoms
There are: 383 VSites
Charge group distribution at step 0: 119 121 124 123 127 118 128 126 117 112
124 118 126 120 130 120 121 136 124 123 118 117 125 130 122 129 127 123 125
125 113 119 124 127 124 124 123 119 128 129 123 128 126 121 119 124 118 129
131 118 119 119 122 128 129 124 121 123 125 120 120 120 116 131
Grid: 6 x 6 x 5 cells
Constraining the starting coordinates (step 0)
Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 3.57e-05
Initial temperature: 311.264 K
Started mdrun on node 0 Sat Feb 13 17:23:39 2010
Step Time Lambda
0 0.00000 0.00000
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
2.20938e+03 1.06206e+03 5.21012e+02 5.34001e+02 1.67617e+04
LJ (SR) LJ (LR) Coulomb (SR) Coulomb (LR) Potential
4.37552e+04 -1.85437e+03 -3.77685e+05 -2.78734e+03 -3.17483e+05
Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
5.90556e+04 -2.58428e+05 3.11564e+02 1.98804e+02 3.56693e-05
DD step 4 load imb.: force 262.0%
At step 5 the performance loss due to force load imbalance is 19.1 %
NOTE: Turning on dynamic load balancing
DD load balancing is limited by minimum cell size in dimension Y Z
DD step 4999 vol min/aver 0.453! load imb.: force 42.2%
Step Time Lambda
5000 20.00000 0.00000
Writing checkpoint, step 5000 at Sat Feb 13 17:23:57 2010
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
2.18559e+03 1.08758e+03 5.08072e+02 5.73181e+02 1.67070e+04
LJ (SR) LJ (LR) Coulomb (SR) Coulomb (LR) Potential
4.39756e+04 -1.84574e+03 -3.78315e+05 -9.08535e+03 -3.24209e+05
Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
5.81564e+04 -2.66053e+05 3.06820e+02 -3.44878e+02 9.68320e-05
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
2.13937e+03 1.08823e+03 4.88467e+02 5.56312e+02 1.66991e+04
LJ (SR) LJ (LR) Coulomb (SR) Coulomb (LR) Potential
4.37569e+04 -1.85173e+03 -3.78660e+05 -7.85919e+03 -3.23642e+05
Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
5.84560e+04 -2.65186e+05 3.08400e+02 -2.56636e+02 0.00000e+00
Total Virial (kJ/mol)
2.14238e+04 1.20840e+02 1.11414e+02
1.21134e+02 2.14442e+04 1.16878e+01
1.11918e+02 1.23263e+01 2.12292e+04
Pressure (bar)
-2.67468e+02 -2.17401e+01 -1.48656e+01
-2.17802e+01 -2.66730e+02 1.77342e-01
-1.49344e+01 9.02074e-02 -2.35709e+02
Total Dipole (Debye)
-3.97323e+02 -3.59815e+02 -1.52774e+02
T-Protein T-CL- T-SOL
2.99534e+02 3.00276e+02 3.08949e+02
<====== ############################### ==>
<==== R M S - F L U C T U A T I O N S ====>
<== ############################### ======>
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
6.39796e+01 4.10873e+01 2.95910e+01 3.76420e+01 4.88986e+01
LJ (SR) LJ (LR) Coulomb (SR) Coulomb (LR) Potential
5.84609e+02 2.13849e+00 1.10640e+03 1.67778e+03 1.10444e+03
Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
3.05395e+02 1.10173e+03 1.61119e+00 1.60301e+02 0.00000e+00
Total Virial (kJ/mol)
1.65615e+03 1.02322e+03 1.00778e+03
1.02179e+03 1.66559e+03 1.04738e+03
1.00766e+03 1.04676e+03 1.69082e+03
Pressure (bar)
2.28103e+02 1.41246e+02 1.39669e+02
1.41035e+02 2.27116e+02 1.45691e+02
1.39680e+02 1.45604e+02 2.31456e+02
Total Dipole (Debye)
3.19197e+02 1.87684e+02 1.24709e+02
T-Protein T-CL- T-SOL
5.84167e+00 7.10486e+01 1.65761e+00
M E G A - F L O P S A C C O U N T I N G
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------
LJ 480.457532 15855.099 1.9
Coulomb 688.452307 18588.212 2.3
Coulomb [W3] 66.644451 5331.556 0.6
Coulomb + LJ 362.642477 13780.414 1.7
Coulomb + LJ [W3] 156.776518 14266.663 1.7
Coulomb + LJ [W3-W3] 2604.244668 638039.944 77.6
Outer nonbonded loop 930.259077 9302.591 1.1
1,4 nonbonded interactions 15.573114 1401.580 0.2
NS-Pairs 3507.987455 73667.737 9.0
Reset In Box 7.889882 23.670 0.0
CG-CoM 23.253414 69.760 0.0
Angles 11.487297 1929.866 0.2
Propers 4.330866 991.768 0.1
Impropers 2.730546 567.954 0.1
Virial 130.461087 2348.300 0.3
Update 116.058207 3597.804 0.4
Stop-CM 116.058207 1160.582 0.1
Calc-Ekin 116.081414 3134.198 0.4
Lincs 15.846123 950.767 0.1
Lincs-Mat 266.595492 1066.382 0.1
Constraint-V 139.069412 1112.555 0.1
Constraint-Vir 123.201821 2956.844 0.4
Settle 35.801468 11563.874 1.4
Virtual Site 3 0.140028 5.181 0.0
Virtual Site 3fd 1.205241 114.498 0.0
Virtual Site 3fad 0.430086 75.695 0.0
Virtual Site 3out 0.140028 12.182 0.0
-----------------------------------------------------------------------
Total 821915.676 100.0
-----------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 146675.8
av. #atoms communicated per step for vsites: 2 x 122.7
av. #atoms communicated per step for LINCS: 2 x 1993.1
Average load imbalance: 63.9 %
Part of the total run time spent waiting due to load imbalance: 6.8 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0
% Y 19 % Z 19 %
NOTE: 6.8 % performance was lost due to load imbalance
in the domain decomposition.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 64 1001 542.428 228.8 19.9
Vsite constr. 64 5001 13.336 5.6 0.5
Comm. coord. 64 5001 201.623 85.0 7.4
Neighbor search 64 1001 468.949 197.8 17.2
Force 64 5001 286.647 120.9 10.5
Wait + Comm. F 64 5001 525.059 221.4 19.2
Vsite spread 64 5001 57.706 24.3 2.1
Write traj. 64 1 0.739 0.3 0.0
Update 64 5001 17.965 7.6 0.7
Constraints 64 5001 168.205 70.9 6.2
Comm. energies 64 5001 432.254 182.3 15.8
Rest 64 16.536 7.0 0.6
-----------------------------------------------------------------------
Total 64 2731.446 1152.0 100.0
-----------------------------------------------------------------------
NOTE: 16 % of the run time was spent communicating energies,
you might want to use the -nosum option of mdrun
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 18.000 18.000 100.0
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 1424.445 45.662 96.019 0.250
Finished mdrun on node 0 Sat Feb 13 17:23:57 2010
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100215/4cec9b43/attachment.html>
More information about the gromacs.org_gmx-users
mailing list