[gmx-users] performance issue with the parallel implementation of gromacs
ashutosh srivastava
ashu4487 at gmail.com
Thu Sep 19 08:07:45 CEST 2013
Hi
I have been trying to run simulation on a cluster consisting of 24 nodes
Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and they
are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
system is TORQUE. However due to some issues with the parallel queue I have
been trying to run the simulations directly on the cluster using mpdboot
and mpirun.
Following is the mdp.out file that I am using for simulation
; VARIOUS PREPROCESSING OPTIONS
; Preprocessor information: use cpp syntax.
; e.g.: -I/home/joe/doe -I/home/mary/roe
include =
; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case sensitive)
define = -DPOSRES
; RUN CONTROL PARAMETERS
integrator = md
; Start time and timestep in ps
tinit = 0
dt = 0.002
nsteps = 250000
; For exact run continuation or redoing part of a run
init-step = 0
; Part index is updated automatically on checkpointing (keeps files
separate)
simulation-part = 1
; mode for center of mass motion removal
comm-mode = Linear
; number of steps for center of mass motion removal
nstcomm = 100
; group(s) for center of mass motion removal
comm-grps =
; LANGEVIN DYNAMICS OPTIONS
; Friction coefficient (amu/ps) and random seed
bd-fric = 0
ld-seed = 1993
; ENERGY MINIMIZATION OPTIONS
; Force tolerance and initial step-size
emtol = 10
emstep = 0.01
; Max number of iterations in relax-shells
niter = 20
; Step size (ps^2) for minimization of flexible constraints
fcstep = 0
; Frequency of steepest descents steps when doing CG
nstcgsteep = 1000
nbfgscorr = 10
; TEST PARTICLE INSERTION OPTIONS
rtpi = 0.05
; OUTPUT CONTROL OPTIONS
; Output frequency for coords (x), velocities (v) and forces (f)
nstxout = 100
nstvout = 100
nstfout = 0
; Output frequency for energies to log file and energy file
nstlog = 100
nstcalcenergy = 100
nstenergy = 100
; Output frequency and precision for .xtc file
nstxtcout = 0
xtc-precision = 1000
; This selects the subset of atoms for the .xtc file. You can
; select multiple groups. By default all atoms will be written.
xtc-grps =
; Selection of energy groups
energygrps =
; NEIGHBORSEARCHING PARAMETERS
; cut-off scheme (group: using charge groups, Verlet: particle based
cut-offs)
cutoff-scheme = Group
; nblist update frequency
nstlist = 5
; ns algorithm (simple or grid)
ns_type = grid
; Periodic boundary conditions: xyz, no, xy
pbc = xyz
periodic-molecules = no
; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom,
; a value of -1 means: use rlist
verlet-buffer-drift = 0.005
; nblist cut-off
rlist = 1.0
; long-range cut-off for switched potentials
rlistlong = -1
nstcalclr = -1
; OPTIONS FOR ELECTROSTATICS AND VDW
; Method for doing electrostatics
coulombtype = PME
coulomb-modifier = Potential-shift-Verlet
rcoulomb-switch = 0
rcoulomb = 1.0
; Relative dielectric constant for the medium and the reaction field
epsilon-r = 1
epsilon-rf = 0
; Method for doing Van der Waals
vdw-type = Cut-off
vdw-modifier = Potential-shift-Verlet
; cut-off lengths
rvdw-switch = 0
rvdw = 1.0
; Apply long range dispersion corrections for Energy and Pressure
DispCorr = EnerPres
; Extension of the potential lookup tables beyond the cut-off
table-extension = 1
; Separate tables between energy group pairs
energygrp-table =
; Spacing for the PME/PPPM FFT grid
fourierspacing = 0.16
; FFT grid size, when a value is 0 fourierspacing will be used
fourier-nx = 0
fourier-ny = 0
fourier-nz = 0
; EWALD/PME/PPPM parameters
pme_order = 4
ewald-rtol = 1e-05
ewald-geometry = 3d
epsilon-surface = 0
optimize-fft = no
; IMPLICIT SOLVENT ALGORITHM
implicit-solvent = No
; GENERALIZED BORN ELECTROSTATICS
; Algorithm for calculating Born radii
gb-algorithm = Still
; Frequency of calculating the Born radii inside rlist
nstgbradii = 1
; Cutoff for Born radii calculation; the contribution from atoms
; between rlist and rgbradii is updated every nstlist steps
rgbradii = 1
; Dielectric coefficient of the implicit solvent
gb-epsilon-solvent = 80
; Salt concentration in M for Generalized Born models
gb-saltconc = 0
; Scaling factors used in the OBC GB model. Default values are OBC(II)
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of GBSA
; The value -1 will set default value for Still/HCT/OBC GB-models.
sa-surface-tension = -1
; OPTIONS FOR WEAK COUPLING ALGORITHMS
; Temperature coupling
tcoupl = V-rescale
nsttcouple = -1
nh-chain-length = 10
print-nose-hoover-chain-variables = no
; Groups to couple separately
tc-grps = Protein Non-Protein
; Time constant (ps) and reference temperature (K)
tau_t = 0.1 0.1
ref_t = 300 300
; pressure coupling
pcoupl = no
pcoupltype = Isotropic
nstpcouple = -1
; Time constant (ps), compressibility (1/bar) and reference P (bar)
tau-p = 1
compressibility =
ref-p =
; Scaling of reference coordinates, No, All or COM
refcoord-scaling = No
; OPTIONS FOR QMMM calculations
QMMM = no
; Groups treated Quantum Mechanically
QMMM-grps =
; QM method
QMmethod =
; QMMM scheme
QMMMscheme = normal
; QM basisset
QMbasis =
; QM charge
QMcharge =
; QM multiplicity
QMmult =
; Surface Hopping
SH =
; CAS space options
CASorbitals =
CASelectrons =
SAon =
SAoff =
SAsteps =
; Scale factor for MM charges
MMChargeScaleFactor = 1
; Optimization of QM subsystem
bOPT =
bTS =
; SIMULATED ANNEALING
; Type of annealing for each temperature group (no/single/periodic)
annealing =
; Number of time points to use for specifying annealing in each group
annealing-npoints =
; List of times at the annealing points for each group
annealing-time =
; Temp. at each annealing point, for each group.
annealing-temp =
; GENERATE VELOCITIES FOR STARTUP RUN
gen_vel = yes
gen_temp = 300
gen_seed = -1
; OPTIONS FOR BONDS
constraints = all-bonds
; Type of constraint algorithm
constraint_algorithm = lincs
; Do not constrain the start configuration
continuation = no
; Use successive overrelaxation to reduce the number of shake iterations
Shake-SOR = no
; Relative tolerance of shake
shake-tol = 0.0001
; Highest order in the expansion of the constraint coupling matrix
lincs_order = 4
; Number of iterations in the final step of LINCS. 1 is fine for
; normal simulations, but use 2 to conserve energy in NVE runs.
; For energy minimization with constraints it should be 4 to 8.
lincs_iter = 1
; Lincs will write a warning to the stderr if in one step a bond
; rotates over more degrees than
lincs-warnangle = 30
; Convert harmonic bonds to morse potentials
morse = no
; ENERGY GROUP EXCLUSIONS
; Pairs of energy groups for which all non-bonded interactions are excluded
energygrp-excl =
; WALLS
; Number of walls, type, atom types, densities and box-z scale factor for
Ewald
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype =
wall-density =
wall-ewald-zfac = 3
; COM PULLING
; Pull type: no, umbrella, constraint or constant-force
pull = no
; ENFORCED ROTATION
; Enforced rotation: No or Yes
rotation = no
; NMR refinement stuff
; Distance restraints type: No, Simple or Ensemble
disre = No
; Force weighting of pairs in one distance restraint: Conservative or Equal
disre-weighting = Conservative
; Use sqrt of the time averaged times the instantaneous violation
disre-mixed = no
disre-fc = 1000
disre-tau = 0
; Output frequency for pair distances to energy file
nstdisreout = 100
; Orientation restraints: No or Yes
orire = no
; Orientation restraints force constant and tau for time averaging
orire-fc = 0
orire-tau = 0
orire-fitgrp =
; Output frequency for trace(SD) and S to energy file
nstorireout = 100
; Free energy variables
free-energy = no
couple-moltype =
couple-lambda0 = vdw-q
couple-lambda1 = vdw-q
couple-intramol = no
init-lambda = -1
init-lambda-state = -1
delta-lambda = 0
nstdhdl = 50
fep-lambdas =
mass-lambdas =
coul-lambdas =
vdw-lambdas =
bonded-lambdas =
restraint-lambdas =
temperature-lambdas =
calc-lambda-neighbors = 1
init-lambda-weights =
dhdl-print-energy = no
sc-alpha = 0
sc-power = 1
sc-r-power = 6
sc-sigma = 0.3
sc-coul = no
separate-dhdl-file = yes
dhdl-derivatives = yes
dh_hist_size = 0
dh_hist_spacing = 0.1
; Non-equilibrium MD stuff
acc-grps =
accelerate =
freezegrps =
freezedim =
cos-acceleration = 0
deform =
; simulated tempering variables
simulated-tempering = no
simulated-tempering-scaling = geometric
sim-temp-low = 300
sim-temp-high = 300
; Electric fields
; Format is number of terms (int) and for all terms an amplitude (real)
; and a phase angle (real)
E-x =
E-xt =
E-y =
E-yt =
E-z =
E-zt =
; AdResS parameters
adress = no
; User defined thingies
user1-grps =
user2-grps =
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
The system has 250853 atoms. I used g_tune_pme in order to check the
performance with different number of processors
Following are the perf.out for 48 and 160 processors respectively
Summary of successful runs:
Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f
DD grid
0 0 8 181.713 7.698 0.952 1.334
8 5 1
1 0 6 156.720 4.086 1.104 1.420
6 7 1
2 0 4 196.320 16.161 0.885 0.916
4 11 1
3 0 3 195.312 1.127 0.886 0.840
3 5 3
4 0 0 370.539 12.942 0.468 -
8 6 1
5 0 -1( 8) 185.688 0.839 0.932 1.322
8 5 1
6 1 8 185.651 14.798 0.934 1.294
8 5 1
7 1 6 155.970 3.320 1.110 1.157
6 7 1
8 1 4 177.021 15.459 0.980 1.005
4 11 1
9 1 3 190.704 22.673 0.914 0.931
3 5 3
10 1 0 293.676 5.460 0.589 -
8 6 1
11 1 -1( 8) 188.978 3.686 0.915 1.266
8 5 1
12 2 8 210.631 17.457 0.824 1.176
8 5 1
13 2 6 171.926 10.462 1.008 1.186
6 7 1
14 2 4 200.015 6.696 0.865 0.839
4 11 1
15 2 3 215.013 5.881 0.804 0.863
3 5 3
16 2 0 298.363 7.187 0.580 -
8 6 1
17 2 -1( 8) 208.821 34.409 0.840 1.088
8 5 1
------------------------------------------------------------
Best performance was achieved with 6 PME nodes (see line 7)
Optimized PME settings:
New Coulomb radius: 1.100000 nm (was 1.000000 nm)
New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
New Fourier grid xyz: 80 80 80 (was 96 96 96)
Please use this command line to launch the simulation:
mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on
Summary of successful runs:
Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f
DD grid
0 0 25 283.628 2.191 0.610 1.749
5 9 3
1 0 20 240.888 9.132 0.719 1.618
5 4 7
2 0 16 166.570 0.394 1.038 1.239
8 6 3
3 0 0 435.389 3.399 0.397 -
10 8 2
4 0 -1( 20) 237.623 6.298 0.729 1.406
5 4 7
5 1 25 286.990 1.662 0.603 1.813
5 9 3
6 1 20 235.818 0.754 0.734 1.495
5 4 7
7 1 16 167.888 3.028 1.030 1.256
8 6 3
8 1 0 284.264 3.775 0.609 -
8 5 4
9 1 -1( 16) 167.858 1.924 1.030 1.303
8 6 3
10 2 25 298.637 1.660 0.579 1.696
5 9 3
11 2 20 281.647 1.074 0.614 1.296
5 4 7
12 2 16 184.012 4.022 0.941 1.244
8 6 3
13 2 0 304.658 0.793 0.568 -
8 5 4
14 2 -1( 16) 183.084 2.203 0.945 1.188
8 6 3
------------------------------------------------------------
Best performance was achieved with 16 PME nodes (see line 2)
and original PME settings.
Please use this command line to launch the simulation:
mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
4icl.tpr -pin on
Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
get on my workstation with Xeon W3550 3.07 GHz using 8 thread (1.431ns/day)
for a similar system.
The bench.log file generated by g_tune PME shows very high load imbalance
(>60% -100 %). I have tried several combinations of np and npme but the
perfomance is always in this range only.
Can someone please tell me what is it that I am doing wrong or how can I
decrease the simulation time.
--
Regards
Ashutosh Srivastava
More information about the gromacs.org_gmx-users
mailing list