[gmx-users] performance issue with the parallel implementation of gromacs

Thu Sep 19 08:07:45 CEST 2013

Hi

I have been trying to run simulation on a cluster consisting of 24 nodes
Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and they
are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
system is TORQUE. However due to some issues with the parallel queue I have
been trying to run the simulations directly on the cluster using mpdboot
and mpirun.
Following is the mdp.out file that I am using for simulation
; VARIOUS PREPROCESSING OPTIONS
; Preprocessor information: use cpp syntax.
; e.g.: -I/home/joe/doe -I/home/mary/roe
include                  =
; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case sensitive)
define                   = -DPOSRES

; RUN CONTROL PARAMETERS
integrator               = md
; Start time and timestep in ps
tinit                    = 0
dt                       = 0.002
nsteps                   = 250000
; For exact run continuation or redoing part of a run
init-step                = 0
; Part index is updated automatically on checkpointing (keeps files
separate)
simulation-part          = 1
; mode for center of mass motion removal
comm-mode                = Linear
; number of steps for center of mass motion removal
nstcomm                  = 100
; group(s) for center of mass motion removal
comm-grps                =

; LANGEVIN DYNAMICS OPTIONS
; Friction coefficient (amu/ps) and random seed
bd-fric                  = 0
ld-seed                  = 1993

; ENERGY MINIMIZATION OPTIONS
; Force tolerance and initial step-size
emtol                    = 10
emstep                   = 0.01
; Max number of iterations in relax-shells
niter                    = 20
; Step size (ps^2) for minimization of flexible constraints
fcstep                   = 0
; Frequency of steepest descents steps when doing CG
nstcgsteep               = 1000
nbfgscorr                = 10

; TEST PARTICLE INSERTION OPTIONS
rtpi                     = 0.05

; OUTPUT CONTROL OPTIONS
; Output frequency for coords (x), velocities (v) and forces (f)
nstxout                  = 100
nstvout                  = 100
nstfout                  = 0
; Output frequency for energies to log file and energy file
nstlog                   = 100
nstcalcenergy            = 100
nstenergy                = 100
; Output frequency and precision for .xtc file
nstxtcout                = 0
xtc-precision            = 1000
; This selects the subset of atoms for the .xtc file. You can
; select multiple groups. By default all atoms will be written.
xtc-grps                 =
; Selection of energy groups
energygrps               =

; NEIGHBORSEARCHING PARAMETERS
; cut-off scheme (group: using charge groups, Verlet: particle based
cut-offs)
cutoff-scheme            = Group
; nblist update frequency
nstlist                  = 5
; ns algorithm (simple or grid)
ns_type                  = grid
; Periodic boundary conditions: xyz, no, xy
pbc                      = xyz
periodic-molecules       = no
; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom,
; a value of -1 means: use rlist
verlet-buffer-drift      = 0.005
; nblist cut-off
rlist                    = 1.0
; long-range cut-off for switched potentials
rlistlong                = -1
nstcalclr                = -1

; OPTIONS FOR ELECTROSTATICS AND VDW
; Method for doing electrostatics
coulombtype              = PME
coulomb-modifier         = Potential-shift-Verlet
rcoulomb-switch          = 0
rcoulomb                 = 1.0
; Relative dielectric constant for the medium and the reaction field
epsilon-r                = 1
epsilon-rf               = 0
; Method for doing Van der Waals
vdw-type                 = Cut-off
vdw-modifier             = Potential-shift-Verlet
; cut-off lengths
rvdw-switch              = 0
rvdw                     = 1.0
; Apply long range dispersion corrections for Energy and Pressure
DispCorr                 = EnerPres
; Extension of the potential lookup tables beyond the cut-off
table-extension          = 1
; Separate tables between energy group pairs
energygrp-table          =
; Spacing for the PME/PPPM FFT grid
fourierspacing           = 0.16
; FFT grid size, when a value is 0 fourierspacing will be used
fourier-nx               = 0
fourier-ny               = 0
fourier-nz               = 0
; EWALD/PME/PPPM parameters
pme_order                = 4
ewald-rtol               = 1e-05
ewald-geometry           = 3d
epsilon-surface          = 0
optimize-fft             = no

; IMPLICIT SOLVENT ALGORITHM
implicit-solvent         = No

; GENERALIZED BORN ELECTROSTATICS
; Algorithm for calculating Born radii
gb-algorithm             = Still
; Frequency of calculating the Born radii inside rlist
nstgbradii               = 1
; Cutoff for Born radii calculation; the contribution from atoms
; between rlist and rgbradii is updated every nstlist steps
rgbradii                 = 1
; Dielectric coefficient of the implicit solvent
gb-epsilon-solvent       = 80
; Salt concentration in M for Generalized Born models
gb-saltconc              = 0
; Scaling factors used in the OBC GB model. Default values are OBC(II)
gb-obc-alpha             = 1
gb-obc-beta              = 0.8
gb-obc-gamma             = 4.85
gb-dielectric-offset     = 0.009
sa-algorithm             = Ace-approximation
; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of GBSA
; The value -1 will set default value for Still/HCT/OBC GB-models.
sa-surface-tension       = -1

; OPTIONS FOR WEAK COUPLING ALGORITHMS
; Temperature coupling
tcoupl                   = V-rescale
nsttcouple               = -1
nh-chain-length          = 10
print-nose-hoover-chain-variables = no
; Groups to couple separately
tc-grps                  = Protein Non-Protein
; Time constant (ps) and reference temperature (K)
tau_t                    = 0.1    0.1
ref_t                    = 300     300
; pressure coupling
pcoupl                   = no
pcoupltype               = Isotropic
nstpcouple               = -1
; Time constant (ps), compressibility (1/bar) and reference P (bar)
tau-p                    = 1
compressibility          =
ref-p                    =
; Scaling of reference coordinates, No, All or COM
refcoord-scaling         = No

; OPTIONS FOR QMMM calculations
QMMM                     = no
; Groups treated Quantum Mechanically
QMMM-grps                =
; QM method
QMmethod                 =
; QMMM scheme
QMMMscheme               = normal
; QM basisset
QMbasis                  =
; QM charge
QMcharge                 =
; QM multiplicity
QMmult                   =
; Surface Hopping
SH                       =
; CAS space options
CASorbitals              =
CASelectrons             =
SAon                     =
SAoff                    =
SAsteps                  =
; Scale factor for MM charges
MMChargeScaleFactor      = 1
; Optimization of QM subsystem
bOPT                     =
bTS                      =

; SIMULATED ANNEALING
; Type of annealing for each temperature group (no/single/periodic)
annealing                =
; Number of time points to use for specifying annealing in each group
annealing-npoints        =
; List of times at the annealing points for each group
annealing-time           =
; Temp. at each annealing point, for each group.
annealing-temp           =

; GENERATE VELOCITIES FOR STARTUP RUN
gen_vel                  = yes
gen_temp                 = 300
gen_seed                 = -1

; OPTIONS FOR BONDS
constraints              = all-bonds
; Type of constraint algorithm
constraint_algorithm     = lincs
; Do not constrain the start configuration
continuation             = no
; Use successive overrelaxation to reduce the number of shake iterations
Shake-SOR                = no
; Relative tolerance of shake
shake-tol                = 0.0001
; Highest order in the expansion of the constraint coupling matrix
lincs_order              = 4
; Number of iterations in the final step of LINCS. 1 is fine for
; normal simulations, but use 2 to conserve energy in NVE runs.
; For energy minimization with constraints it should be 4 to 8.
lincs_iter               = 1
; Lincs will write a warning to the stderr if in one step a bond
; rotates over more degrees than
lincs-warnangle          = 30
; Convert harmonic bonds to morse potentials
morse                    = no

; ENERGY GROUP EXCLUSIONS
; Pairs of energy groups for which all non-bonded interactions are excluded
energygrp-excl           =

; WALLS
; Number of walls, type, atom types, densities and box-z scale factor for
Ewald
nwall                    = 0
wall-type                = 9-3
wall-r-linpot            = -1
wall-atomtype            =
wall-density             =
wall-ewald-zfac          = 3

; COM PULLING
; Pull type: no, umbrella, constraint or constant-force
pull                     = no

; ENFORCED ROTATION
; Enforced rotation: No or Yes
rotation                 = no

; NMR refinement stuff
; Distance restraints type: No, Simple or Ensemble
disre                    = No
; Force weighting of pairs in one distance restraint: Conservative or Equal
disre-weighting          = Conservative
; Use sqrt of the time averaged times the instantaneous violation
disre-mixed              = no
disre-fc                 = 1000
disre-tau                = 0
; Output frequency for pair distances to energy file
nstdisreout              = 100
; Orientation restraints: No or Yes
orire                    = no
; Orientation restraints force constant and tau for time averaging
orire-fc                 = 0
orire-tau                = 0
orire-fitgrp             =
; Output frequency for trace(SD) and S to energy file
nstorireout              = 100

; Free energy variables
free-energy              = no
couple-moltype           =
couple-lambda0           = vdw-q
couple-lambda1           = vdw-q
couple-intramol          = no
init-lambda              = -1
init-lambda-state        = -1
delta-lambda             = 0
nstdhdl                  = 50
fep-lambdas              =
mass-lambdas             =
coul-lambdas             =
vdw-lambdas              =
bonded-lambdas           =
restraint-lambdas        =
temperature-lambdas      =
calc-lambda-neighbors    = 1
init-lambda-weights      =
dhdl-print-energy        = no
sc-alpha                 = 0
sc-power                 = 1
sc-r-power               = 6
sc-sigma                 = 0.3
sc-coul                  = no
separate-dhdl-file       = yes
dhdl-derivatives         = yes
dh_hist_size             = 0
dh_hist_spacing          = 0.1

; Non-equilibrium MD stuff
acc-grps                 =
accelerate               =
freezegrps               =
freezedim                =
cos-acceleration         = 0
deform                   =

; simulated tempering variables
simulated-tempering      = no
simulated-tempering-scaling = geometric
sim-temp-low             = 300
sim-temp-high            = 300

; Electric fields
; Format is number of terms (int) and for all terms an amplitude (real)
; and a phase angle (real)
E-x                      =
E-xt                     =
E-y                      =
E-yt                     =
E-z                      =
E-zt                     =

; AdResS parameters
adress                   = no

; User defined thingies
user1-grps               =
user2-grps               =
userint1                 = 0
userint2                 = 0
userint3                 = 0
userint4                 = 0
userreal1                = 0
userreal2                = 0
userreal3                = 0
userreal4                = 0

The system has 250853 atoms. I used g_tune_pme in order to check the
performance with different number of processors
Following are the perf.out for 48 and 160 processors respectively

Summary of successful runs:
Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
DD grid
   0   0    8           181.713        7.698        0.952        1.334
8   5   1
   1   0    6           156.720        4.086        1.104        1.420
6   7   1
   2   0    4           196.320       16.161        0.885        0.916
4  11   1
   3   0    3           195.312        1.127        0.886        0.840
3   5   3
   4   0    0           370.539       12.942        0.468          -
8   6   1
   5   0   -1(  8)      185.688        0.839        0.932        1.322
8   5   1
   6   1    8           185.651       14.798        0.934        1.294
8   5   1
   7   1    6           155.970        3.320        1.110        1.157
6   7   1
   8   1    4           177.021       15.459        0.980        1.005
4  11   1
   9   1    3           190.704       22.673        0.914        0.931
3   5   3
  10   1    0           293.676        5.460        0.589          -
8   6   1
  11   1   -1(  8)      188.978        3.686        0.915        1.266
8   5   1
  12   2    8           210.631       17.457        0.824        1.176
8   5   1
  13   2    6           171.926       10.462        1.008        1.186
6   7   1
  14   2    4           200.015        6.696        0.865        0.839
4  11   1
  15   2    3           215.013        5.881        0.804        0.863
3   5   3
  16   2    0           298.363        7.187        0.580          -
8   6   1
  17   2   -1(  8)      208.821       34.409        0.840        1.088
8   5   1

------------------------------------------------------------
Best performance was achieved with 6 PME nodes (see line 7)
Optimized PME settings:
   New Coulomb radius: 1.100000 nm (was 1.000000 nm)
   New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
   New Fourier grid xyz: 80 80 80 (was 96 96 96)
Please use this command line to launch the simulation:

mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on

Summary of successful runs:
Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
DD grid
   0   0   25           283.628        2.191        0.610        1.749
5   9   3
   1   0   20           240.888        9.132        0.719        1.618
5   4   7
   2   0   16           166.570        0.394        1.038        1.239
8   6   3
   3   0    0           435.389        3.399        0.397          -
10   8   2
   4   0   -1( 20)      237.623        6.298        0.729        1.406
5   4   7
   5   1   25           286.990        1.662        0.603        1.813
5   9   3
   6   1   20           235.818        0.754        0.734        1.495
5   4   7
   7   1   16           167.888        3.028        1.030        1.256
8   6   3
   8   1    0           284.264        3.775        0.609          -
8   5   4
   9   1   -1( 16)      167.858        1.924        1.030        1.303
8   6   3
  10   2   25           298.637        1.660        0.579        1.696
5   9   3
  11   2   20           281.647        1.074        0.614        1.296
5   4   7
  12   2   16           184.012        4.022        0.941        1.244
8   6   3
  13   2    0           304.658        0.793        0.568          -
8   5   4
  14   2   -1( 16)      183.084        2.203        0.945        1.188
8   6   3

------------------------------------------------------------
Best performance was achieved with 16 PME nodes (see line 2)
and original PME settings.
Please use this command line to launch the simulation:

mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
4icl.tpr -pin on

Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
get on my workstation with Xeon W3550 3.07 GHz using 8 thread (1.431ns/day)
for a similar system.
The bench.log file generated by g_tune PME shows very high load imbalance
(>60% -100 %). I have tried several combinations of np and npme but the
perfomance is always in this range only.
Can someone please tell me what is it that I am doing wrong or how can I
decrease the simulation time.
-- 
Regards
Ashutosh Srivastava