[gmx-users] Poor load balancing

Deniz KARASU karasudeniz at gmail.com
Mon Feb 15 15:13:05 CET 2010


Hi All,

I'm trying to d.lzm gromacs benchmarks with 64 node machine, but   dynamic
load balancing performance is very low.

Any suggestion will be of great help.

Thanks.

Deniz KARASU

Log file opened on Sat Feb 13 17:23:37 2010
Host: d077.uybhm.itu.edu.tr  pid: 20157  nodeid: 0  nnodes:  64
The Gromacs distribution was built Thu Sep 10 11:45:26 EEST 2009 by
mds.fatma at lnode1.uybhm.itu.edu.tr (Linux
2.6.18-53.1.14.el5_lustre.1.6.5.1smp x86_64)


                         :-)  G  R  O  M  A  C  S  (-:

                 Good ROcking Metal Altar for Chronical Sinners

                            :-)  VERSION 4.0.5  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

        :-)  /AKDENIZ/HOME005/users/mds.fatma/rs/software/bin/mdrun  (-:


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

parameters of the run:
   integrator           = md
   nsteps               = 5000
   init_step            = 0
   ns_type              = Grid
   nstlist              = 5
   ndelta               = 2
   nstcomm              = 1
   comm_mode            = Linear
   nstlog               = 0
   nstxout              = 0
   nstvout              = 0
   nstfout              = 0
   nstenergy            = 0
   nstxtcout            = 0
   init_t               = 0
   delta_t              = 0.004
   xtcprec              = 1000
   nkx                  = 0
   nky                  = 0
   nkz                  = 0
   pme_order            = 4
   ewald_rtol           = 1e-05
   ewald_geometry       = 0
   epsilon_surface      = 0
   optimize_fft         = FALSE
   ePBC                 = xyz
   bPeriodicMols        = FALSE
   bContinuation        = FALSE
   bShakeSOR            = FALSE
   etc                  = Berendsen
   epc                  = No
   epctype              = Isotropic
   tau_p                = 1
   ref_p (3x3):
      ref_p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref_p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      ref_p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   compress (3x3):
      compress[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compress[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      compress[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   refcoord_scaling     = No
   posres_com (3):
      posres_com[0]= 0.00000e+00
      posres_com[1]= 0.00000e+00
      posres_com[2]= 0.00000e+00
   posres_comB (3):
      posres_comB[0]= 0.00000e+00
      posres_comB[1]= 0.00000e+00
      posres_comB[2]= 0.00000e+00
   andersen_seed        = 815131
   rlist                = 0.9
   rtpi                 = 0.05
   coulombtype          = Cut-off
   rcoulomb_switch      = 0
   rcoulomb             = 1.4
   vdwtype              = Cut-off
   rvdw_switch          = 0
   rvdw                 = 1.4
   epsilon_r            = 1
   epsilon_rf           = 1
   tabext               = 1
   implicit_solvent     = No
   gb_algorithm         = Still
   gb_epsilon_solvent   = 80
   nstgbradii           = 1
   rgbradii             = 2
   gb_saltconc          = 0
   gb_obc_alpha         = 1
   gb_obc_beta          = 0.8
   gb_obc_gamma         = 4.85
   sa_surface_tension   = 2.092
   DispCorr             = No
   free_energy          = no
   init_lambda          = 0
   sc_alpha             = 0
   sc_power             = 0
   sc_sigma             = 0.3
   delta_lambda         = 0
   nwall                = 0
   wall_type            = 9-3
   wall_atomtype[0]     = -1
   wall_atomtype[1]     = -1
   wall_density[0]      = 0
   wall_density[1]      = 0
   wall_ewald_zfac      = 3
   pull                 = no
   disre                = No
   disre_weighting      = Conservative
   disre_mixed          = FALSE
   dr_fc                = 1000
   dr_tau               = 0
   nstdisreout          = 100
   orires_fc            = 0
   orires_tau           = 0
   nstorireout          = 100
   dihre-fc             = 1000
   em_stepsize          = 0.01
   em_tol               = 10
   niter                = 20
   fc_stepsize          = 0
   nstcgsteep           = 1000
   nbfgscorr            = 10
   ConstAlg             = Lincs
   shake_tol            = 0.0001
   lincs_order          = 4
   lincs_warnangle      = 30
   lincs_iter           = 1
   bd_fric              = 0
   ld_seed              = 1993
   cos_accel            = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   userint1             = 0
   userint2             = 0
   userint3             = 0
   userint4             = 0
   userreal1            = 0
   userreal2            = 0
   userreal3            = 0
   userreal4            = 0
grpopts:
   nrdf:     2636.83     23.9984     42933.2
   ref_t:         300         300         300
   tau_t:         0.1         0.1         0.1
anneal:          No          No          No
ann_npoints:           0           0           0
   acc:               0           0           0
   nfreeze:           N           N           N
   energygrp_flags[  0]: 0
   efield-x:
      n = 0
   efield-xt:
      n = 0
   efield-y:
      n = 0
   efield-yt:
      n = 0
   efield-z:
      n = 0
   efield-zt:
      n = 0
   bQMMM                = FALSE
   QMconstraints        = 0
   QMMMscheme           = 0
   scalefactor          = 1
qm_opts:
   ngQM                 = 0

Initializing Domain Decomposition on 64 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.571 nm, LJ-14, atoms 439 442
  multi-body bonded interactions: 0.571 nm, Proper Dih., atoms 439 442
Minimum cell size due to bonded interactions: 0.628 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.825 nm
Estimated maximum distance required for P-LINCS: 0.825 nm
This distance will limit the DD cell size, you can override this with -rcon
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 64 cells with a minimum initial size of 1.031 nm
The maximum allowed number of cells is: X 5 Y 5 Z 4
Domain decomposition grid 4 x 4 x 4, separate PME nodes 0
Domain decomposition nodeid 0, coordinates 0 0 0

Using two step summing over 11 groups of on average 5.8 processes

Table routines are used for coulomb: FALSE
Table routines are used for vdw:     FALSE
Cut-off's:   NS: 0.9   Coulomb: 1.4   LJ: 1.4
System total charge: 0.000
Generated table with 1200 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1200 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1200 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Enabling SPC water optimization for 7156 molecules.

Configuring nonbonded kernels...
Testing x86_64 SSE support... present.


Removing pbc first time

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------

The number of constraints is 1407
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration
117 constraints are involved in constraint triangles,
will apply an additional matrix expansion of order 4 for couplings
between constraints inside triangles

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------


Linking all bonded interactions to atoms
There are 379 inter charge-group virtual sites,
will an extra communication step for selected coordinates and forces

The initial number of communication pulses is: X 1 Y 1 Z 2
The initial domain decomposition cell size is: X 1.43 nm Y 1.43 nm Z 1.24 nm

The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.400 nm
            two-body bonded interactions  (-rdd)   1.400 nm
          multi-body bonded interactions  (-rdd)   1.239 nm
              virtual site constructions  (-rcon)  1.239 nm
  atoms separated by up to 5 constraints  (-rcon)  1.239 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2 Y 2 Z 2
The minimum size for domain decomposition cells is 0.905 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.63 Y 0.63 Z 0.73
The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.400 nm
            two-body bonded interactions  (-rdd)   1.400 nm
          multi-body bonded interactions  (-rdd)   0.905 nm
              virtual site constructions  (-rcon)  0.905 nm
  atoms separated by up to 5 constraints  (-rcon)  0.905 nm


Making 3D domain decomposition grid 4 x 4 x 4, home cell index 0 0 0

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, J. P. M. Postma, A. DiNola and J. R. Haak
Molecular dynamics with coupling to an external bath
J. Chem. Phys. 81 (1984) pp. 3684-3690
-------- -------- --- Thank You --- -------- --------

There are: 22824 Atoms
There are: 383 VSites
Charge group distribution at step 0: 119 121 124 123 127 118 128 126 117 112
124 118 126 120 130 120 121 136 124 123 118 117 125 130 122 129 127 123 125
125 113 119 124 127 124 124 123 119 128 129 123 128 126 121 119 124 118 129
131 118 119 119 122 128 129 124 121 123 125 120 120 120 116 131
Grid: 6 x 6 x 5 cells

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 3.57e-05
Initial temperature: 311.264 K

Started mdrun on node 0 Sat Feb 13 17:23:39 2010

           Step           Time         Lambda
              0        0.00000        0.00000

   Energies (kJ/mol)
       G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
    2.20938e+03    1.06206e+03    5.21012e+02    5.34001e+02    1.67617e+04
        LJ (SR)        LJ (LR)   Coulomb (SR)   Coulomb (LR)      Potential
    4.37552e+04   -1.85437e+03   -3.77685e+05   -2.78734e+03   -3.17483e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    5.90556e+04   -2.58428e+05    3.11564e+02    1.98804e+02    3.56693e-05

DD  step 4 load imb.: force 262.0%

At step 5 the performance loss due to force load imbalance is 19.1 %

NOTE: Turning on dynamic load balancing

DD  load balancing is limited by minimum cell size in dimension Y Z
DD  step 4999  vol min/aver 0.453! load imb.: force 42.2%

           Step           Time         Lambda
           5000       20.00000        0.00000

Writing checkpoint, step 5000 at Sat Feb 13 17:23:57 2010

   Energies (kJ/mol)
       G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
    2.18559e+03    1.08758e+03    5.08072e+02    5.73181e+02    1.67070e+04
        LJ (SR)        LJ (LR)   Coulomb (SR)   Coulomb (LR)      Potential
    4.39756e+04   -1.84574e+03   -3.78315e+05   -9.08535e+03   -3.24209e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    5.81564e+04   -2.66053e+05    3.06820e+02   -3.44878e+02    9.68320e-05

    <======  ###############  ==>
    <====  A V E R A G E S  ====>
    <==  ###############  ======>

   Energies (kJ/mol)
       G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
    2.13937e+03    1.08823e+03    4.88467e+02    5.56312e+02    1.66991e+04
        LJ (SR)        LJ (LR)   Coulomb (SR)   Coulomb (LR)      Potential
    4.37569e+04   -1.85173e+03   -3.78660e+05   -7.85919e+03   -3.23642e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    5.84560e+04   -2.65186e+05    3.08400e+02   -2.56636e+02    0.00000e+00

   Total Virial (kJ/mol)
    2.14238e+04    1.20840e+02    1.11414e+02
    1.21134e+02    2.14442e+04    1.16878e+01
    1.11918e+02    1.23263e+01    2.12292e+04

   Pressure (bar)
   -2.67468e+02   -2.17401e+01   -1.48656e+01
   -2.17802e+01   -2.66730e+02    1.77342e-01
   -1.49344e+01    9.02074e-02   -2.35709e+02

   Total Dipole (Debye)
   -3.97323e+02   -3.59815e+02   -1.52774e+02

      T-Protein          T-CL-          T-SOL
    2.99534e+02    3.00276e+02    3.08949e+02

    <======  ###############################  ==>
    <====  R M S - F L U C T U A T I O N S  ====>
    <==  ###############################  ======>

   Energies (kJ/mol)
       G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
    6.39796e+01    4.10873e+01    2.95910e+01    3.76420e+01    4.88986e+01
        LJ (SR)        LJ (LR)   Coulomb (SR)   Coulomb (LR)      Potential
    5.84609e+02    2.13849e+00    1.10640e+03    1.67778e+03    1.10444e+03
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    3.05395e+02    1.10173e+03    1.61119e+00    1.60301e+02    0.00000e+00

   Total Virial (kJ/mol)
    1.65615e+03    1.02322e+03    1.00778e+03
    1.02179e+03    1.66559e+03    1.04738e+03
    1.00766e+03    1.04676e+03    1.69082e+03

   Pressure (bar)
    2.28103e+02    1.41246e+02    1.39669e+02
    1.41035e+02    2.27116e+02    1.45691e+02
    1.39680e+02    1.45604e+02    2.31456e+02

   Total Dipole (Debye)
    3.19197e+02    1.87684e+02    1.24709e+02

      T-Protein          T-CL-          T-SOL
    5.84167e+00    7.10486e+01    1.65761e+00


    M E G A - F L O P S   A C C O U N T I N G

   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                         M-Number         M-Flops  % Flops
-----------------------------------------------------------------------
 LJ                               480.457532       15855.099     1.9
 Coulomb                          688.452307       18588.212     2.3
 Coulomb [W3]                      66.644451        5331.556     0.6
 Coulomb + LJ                     362.642477       13780.414     1.7
 Coulomb + LJ [W3]                156.776518       14266.663     1.7
 Coulomb + LJ [W3-W3]            2604.244668      638039.944    77.6
 Outer nonbonded loop             930.259077        9302.591     1.1
 1,4 nonbonded interactions        15.573114        1401.580     0.2
 NS-Pairs                        3507.987455       73667.737     9.0
 Reset In Box                       7.889882          23.670     0.0
 CG-CoM                            23.253414          69.760     0.0
 Angles                            11.487297        1929.866     0.2
 Propers                            4.330866         991.768     0.1
 Impropers                          2.730546         567.954     0.1
 Virial                           130.461087        2348.300     0.3
 Update                           116.058207        3597.804     0.4
 Stop-CM                          116.058207        1160.582     0.1
 Calc-Ekin                        116.081414        3134.198     0.4
 Lincs                             15.846123         950.767     0.1
 Lincs-Mat                        266.595492        1066.382     0.1
 Constraint-V                     139.069412        1112.555     0.1
 Constraint-Vir                   123.201821        2956.844     0.4
 Settle                            35.801468       11563.874     1.4
 Virtual Site 3                     0.140028           5.181     0.0
 Virtual Site 3fd                   1.205241         114.498     0.0
 Virtual Site 3fad                  0.430086          75.695     0.0
 Virtual Site 3out                  0.140028          12.182     0.0
-----------------------------------------------------------------------
 Total                                            821915.676   100.0
-----------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 146675.8
 av. #atoms communicated per step for vsites: 2 x 122.7
 av. #atoms communicated per step for LINCS:  2 x 1993.1

 Average load imbalance: 63.9 %
 Part of the total run time spent waiting due to load imbalance: 6.8 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0
% Y 19 % Z 19 %

NOTE: 6.8 % performance was lost due to load imbalance
      in the domain decomposition.


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        64       1001      542.428      228.8    19.9
 Vsite constr.         64       5001       13.336        5.6     0.5
 Comm. coord.          64       5001      201.623       85.0     7.4
 Neighbor search       64       1001      468.949      197.8    17.2
 Force                 64       5001      286.647      120.9    10.5
 Wait + Comm. F        64       5001      525.059      221.4    19.2
 Vsite spread          64       5001       57.706       24.3     2.1
 Write traj.           64          1        0.739        0.3     0.0
 Update                64       5001       17.965        7.6     0.7
 Constraints           64       5001      168.205       70.9     6.2
 Comm. energies        64       5001      432.254      182.3    15.8
 Rest                  64                  16.536        7.0     0.6
-----------------------------------------------------------------------
 Total                 64                2731.446     1152.0   100.0
-----------------------------------------------------------------------

NOTE: 16 % of the run time was spent communicating energies,
      you might want to use the -nosum option of mdrun


    Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     18.000     18.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:   1424.445     45.662     96.019      0.250
Finished mdrun on node 0 Sat Feb 13 17:23:57 2010
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100215/4cec9b43/attachment.html>


More information about the gromacs.org_gmx-users mailing list