[gmx-users] performance issue with the parallel implementation of gromacs

Thu Sep 19 10:22:05 CEST 2013

Hi,

make a scaling test and run on a single node only at first. So you can
estimate what performance you can at most expect when going to more nodes.

On a single node, you can also run with Gromacs' thread-MPI, thus 
eliminating the possibility that something with your MPI is wrong.

There are lots of reasons why your parallel performance could be bad.
Can you check that actually the Infiniband interconnect is used and
not the Ethernet? It could also be that a single process is still
running on any of your cores and eating up CPU time. Or maybe the
pinning of threads to cores is not correct (what does md.log say
about that?).

Just a few ideas.

Good luck!

Carsten

On Sep 19, 2013, at 8:07 AM, ashutosh srivastava <ashu4487 at gmail.com> wrote:

> Hi
> 
> I have been trying to run simulation on a cluster consisting of 24 nodes
> Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and they
> are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
> system is TORQUE. However due to some issues with the parallel queue I have
> been trying to run the simulations directly on the cluster using mpdboot
> and mpirun.
> Following is the mdp.out file that I am using for simulation
> ; VARIOUS PREPROCESSING OPTIONS
> ; Preprocessor information: use cpp syntax.
> ; e.g.: -I/home/joe/doe -I/home/mary/roe
> include                  =
> ; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case sensitive)
> define                   = -DPOSRES
> 
> ; RUN CONTROL PARAMETERS
> integrator               = md
> ; Start time and timestep in ps
> tinit                    = 0
> dt                       = 0.002
> nsteps                   = 250000
> ; For exact run continuation or redoing part of a run
> init-step                = 0
> ; Part index is updated automatically on checkpointing (keeps files
> separate)
> simulation-part          = 1
> ; mode for center of mass motion removal
> comm-mode                = Linear
> ; number of steps for center of mass motion removal
> nstcomm                  = 100
> ; group(s) for center of mass motion removal
> comm-grps                =
> 
> ; LANGEVIN DYNAMICS OPTIONS
> ; Friction coefficient (amu/ps) and random seed
> bd-fric                  = 0
> ld-seed                  = 1993
> 
> ; ENERGY MINIMIZATION OPTIONS
> ; Force tolerance and initial step-size
> emtol                    = 10
> emstep                   = 0.01
> ; Max number of iterations in relax-shells
> niter                    = 20
> ; Step size (ps^2) for minimization of flexible constraints
> fcstep                   = 0
> ; Frequency of steepest descents steps when doing CG
> nstcgsteep               = 1000
> nbfgscorr                = 10
> 
> ; TEST PARTICLE INSERTION OPTIONS
> rtpi                     = 0.05
> 
> ; OUTPUT CONTROL OPTIONS
> ; Output frequency for coords (x), velocities (v) and forces (f)
> nstxout                  = 100
> nstvout                  = 100
> nstfout                  = 0
> ; Output frequency for energies to log file and energy file
> nstlog                   = 100
> nstcalcenergy            = 100
> nstenergy                = 100
> ; Output frequency and precision for .xtc file
> nstxtcout                = 0
> xtc-precision            = 1000
> ; This selects the subset of atoms for the .xtc file. You can
> ; select multiple groups. By default all atoms will be written.
> xtc-grps                 =
> ; Selection of energy groups
> energygrps               =
> 
> ; NEIGHBORSEARCHING PARAMETERS
> ; cut-off scheme (group: using charge groups, Verlet: particle based
> cut-offs)
> cutoff-scheme            = Group
> ; nblist update frequency
> nstlist                  = 5
> ; ns algorithm (simple or grid)
> ns_type                  = grid
> ; Periodic boundary conditions: xyz, no, xy
> pbc                      = xyz
> periodic-molecules       = no
> ; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom,
> ; a value of -1 means: use rlist
> verlet-buffer-drift      = 0.005
> ; nblist cut-off
> rlist                    = 1.0
> ; long-range cut-off for switched potentials
> rlistlong                = -1
> nstcalclr                = -1
> 
> ; OPTIONS FOR ELECTROSTATICS AND VDW
> ; Method for doing electrostatics
> coulombtype              = PME
> coulomb-modifier         = Potential-shift-Verlet
> rcoulomb-switch          = 0
> rcoulomb                 = 1.0
> ; Relative dielectric constant for the medium and the reaction field
> epsilon-r                = 1
> epsilon-rf               = 0
> ; Method for doing Van der Waals
> vdw-type                 = Cut-off
> vdw-modifier             = Potential-shift-Verlet
> ; cut-off lengths
> rvdw-switch              = 0
> rvdw                     = 1.0
> ; Apply long range dispersion corrections for Energy and Pressure
> DispCorr                 = EnerPres
> ; Extension of the potential lookup tables beyond the cut-off
> table-extension          = 1
> ; Separate tables between energy group pairs
> energygrp-table          =
> ; Spacing for the PME/PPPM FFT grid
> fourierspacing           = 0.16
> ; FFT grid size, when a value is 0 fourierspacing will be used
> fourier-nx               = 0
> fourier-ny               = 0
> fourier-nz               = 0
> ; EWALD/PME/PPPM parameters
> pme_order                = 4
> ewald-rtol               = 1e-05
> ewald-geometry           = 3d
> epsilon-surface          = 0
> optimize-fft             = no
> 
> ; IMPLICIT SOLVENT ALGORITHM
> implicit-solvent         = No
> 
> ; GENERALIZED BORN ELECTROSTATICS
> ; Algorithm for calculating Born radii
> gb-algorithm             = Still
> ; Frequency of calculating the Born radii inside rlist
> nstgbradii               = 1
> ; Cutoff for Born radii calculation; the contribution from atoms
> ; between rlist and rgbradii is updated every nstlist steps
> rgbradii                 = 1
> ; Dielectric coefficient of the implicit solvent
> gb-epsilon-solvent       = 80
> ; Salt concentration in M for Generalized Born models
> gb-saltconc              = 0
> ; Scaling factors used in the OBC GB model. Default values are OBC(II)
> gb-obc-alpha             = 1
> gb-obc-beta              = 0.8
> gb-obc-gamma             = 4.85
> gb-dielectric-offset     = 0.009
> sa-algorithm             = Ace-approximation
> ; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of GBSA
> ; The value -1 will set default value for Still/HCT/OBC GB-models.
> sa-surface-tension       = -1
> 
> ; OPTIONS FOR WEAK COUPLING ALGORITHMS
> ; Temperature coupling
> tcoupl                   = V-rescale
> nsttcouple               = -1
> nh-chain-length          = 10
> print-nose-hoover-chain-variables = no
> ; Groups to couple separately
> tc-grps                  = Protein Non-Protein
> ; Time constant (ps) and reference temperature (K)
> tau_t                    = 0.1    0.1
> ref_t                    = 300     300
> ; pressure coupling
> pcoupl                   = no
> pcoupltype               = Isotropic
> nstpcouple               = -1
> ; Time constant (ps), compressibility (1/bar) and reference P (bar)
> tau-p                    = 1
> compressibility          =
> ref-p                    =
> ; Scaling of reference coordinates, No, All or COM
> refcoord-scaling         = No
> 
> ; OPTIONS FOR QMMM calculations
> QMMM                     = no
> ; Groups treated Quantum Mechanically
> QMMM-grps                =
> ; QM method
> QMmethod                 =
> ; QMMM scheme
> QMMMscheme               = normal
> ; QM basisset
> QMbasis                  =
> ; QM charge
> QMcharge                 =
> ; QM multiplicity
> QMmult                   =
> ; Surface Hopping
> SH                       =
> ; CAS space options
> CASorbitals              =
> CASelectrons             =
> SAon                     =
> SAoff                    =
> SAsteps                  =
> ; Scale factor for MM charges
> MMChargeScaleFactor      = 1
> ; Optimization of QM subsystem
> bOPT                     =
> bTS                      =
> 
> ; SIMULATED ANNEALING
> ; Type of annealing for each temperature group (no/single/periodic)
> annealing                =
> ; Number of time points to use for specifying annealing in each group
> annealing-npoints        =
> ; List of times at the annealing points for each group
> annealing-time           =
> ; Temp. at each annealing point, for each group.
> annealing-temp           =
> 
> ; GENERATE VELOCITIES FOR STARTUP RUN
> gen_vel                  = yes
> gen_temp                 = 300
> gen_seed                 = -1
> 
> ; OPTIONS FOR BONDS
> constraints              = all-bonds
> ; Type of constraint algorithm
> constraint_algorithm     = lincs
> ; Do not constrain the start configuration
> continuation             = no
> ; Use successive overrelaxation to reduce the number of shake iterations
> Shake-SOR                = no
> ; Relative tolerance of shake
> shake-tol                = 0.0001
> ; Highest order in the expansion of the constraint coupling matrix
> lincs_order              = 4
> ; Number of iterations in the final step of LINCS. 1 is fine for
> ; normal simulations, but use 2 to conserve energy in NVE runs.
> ; For energy minimization with constraints it should be 4 to 8.
> lincs_iter               = 1
> ; Lincs will write a warning to the stderr if in one step a bond
> ; rotates over more degrees than
> lincs-warnangle          = 30
> ; Convert harmonic bonds to morse potentials
> morse                    = no
> 
> ; ENERGY GROUP EXCLUSIONS
> ; Pairs of energy groups for which all non-bonded interactions are excluded
> energygrp-excl           =
> 
> ; WALLS
> ; Number of walls, type, atom types, densities and box-z scale factor for
> Ewald
> nwall                    = 0
> wall-type                = 9-3
> wall-r-linpot            = -1
> wall-atomtype            =
> wall-density             =
> wall-ewald-zfac          = 3
> 
> ; COM PULLING
> ; Pull type: no, umbrella, constraint or constant-force
> pull                     = no
> 
> ; ENFORCED ROTATION
> ; Enforced rotation: No or Yes
> rotation                 = no
> 
> ; NMR refinement stuff
> ; Distance restraints type: No, Simple or Ensemble
> disre                    = No
> ; Force weighting of pairs in one distance restraint: Conservative or Equal
> disre-weighting          = Conservative
> ; Use sqrt of the time averaged times the instantaneous violation
> disre-mixed              = no
> disre-fc                 = 1000
> disre-tau                = 0
> ; Output frequency for pair distances to energy file
> nstdisreout              = 100
> ; Orientation restraints: No or Yes
> orire                    = no
> ; Orientation restraints force constant and tau for time averaging
> orire-fc                 = 0
> orire-tau                = 0
> orire-fitgrp             =
> ; Output frequency for trace(SD) and S to energy file
> nstorireout              = 100
> 
> ; Free energy variables
> free-energy              = no
> couple-moltype           =
> couple-lambda0           = vdw-q
> couple-lambda1           = vdw-q
> couple-intramol          = no
> init-lambda              = -1
> init-lambda-state        = -1
> delta-lambda             = 0
> nstdhdl                  = 50
> fep-lambdas              =
> mass-lambdas             =
> coul-lambdas             =
> vdw-lambdas              =
> bonded-lambdas           =
> restraint-lambdas        =
> temperature-lambdas      =
> calc-lambda-neighbors    = 1
> init-lambda-weights      =
> dhdl-print-energy        = no
> sc-alpha                 = 0
> sc-power                 = 1
> sc-r-power               = 6
> sc-sigma                 = 0.3
> sc-coul                  = no
> separate-dhdl-file       = yes
> dhdl-derivatives         = yes
> dh_hist_size             = 0
> dh_hist_spacing          = 0.1
> 
> ; Non-equilibrium MD stuff
> acc-grps                 =
> accelerate               =
> freezegrps               =
> freezedim                =
> cos-acceleration         = 0
> deform                   =
> 
> ; simulated tempering variables
> simulated-tempering      = no
> simulated-tempering-scaling = geometric
> sim-temp-low             = 300
> sim-temp-high            = 300
> 
> ; Electric fields
> ; Format is number of terms (int) and for all terms an amplitude (real)
> ; and a phase angle (real)
> E-x                      =
> E-xt                     =
> E-y                      =
> E-yt                     =
> E-z                      =
> E-zt                     =
> 
> ; AdResS parameters
> adress                   = no
> 
> ; User defined thingies
> user1-grps               =
> user2-grps               =
> userint1                 = 0
> userint2                 = 0
> userint3                 = 0
> userint4                 = 0
> userreal1                = 0
> userreal2                = 0
> userreal3                = 0
> userreal4                = 0
> 
> 
> The system has 250853 atoms. I used g_tune_pme in order to check the
> performance with different number of processors
> Following are the perf.out for 48 and 160 processors respectively
> 
> Summary of successful runs:
> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> DD grid
>   0   0    8           181.713        7.698        0.952        1.334
> 8   5   1
>   1   0    6           156.720        4.086        1.104        1.420
> 6   7   1
>   2   0    4           196.320       16.161        0.885        0.916
> 4  11   1
>   3   0    3           195.312        1.127        0.886        0.840
> 3   5   3
>   4   0    0           370.539       12.942        0.468          -
> 8   6   1
>   5   0   -1(  8)      185.688        0.839        0.932        1.322
> 8   5   1
>   6   1    8           185.651       14.798        0.934        1.294
> 8   5   1
>   7   1    6           155.970        3.320        1.110        1.157
> 6   7   1
>   8   1    4           177.021       15.459        0.980        1.005
> 4  11   1
>   9   1    3           190.704       22.673        0.914        0.931
> 3   5   3
>  10   1    0           293.676        5.460        0.589          -
> 8   6   1
>  11   1   -1(  8)      188.978        3.686        0.915        1.266
> 8   5   1
>  12   2    8           210.631       17.457        0.824        1.176
> 8   5   1
>  13   2    6           171.926       10.462        1.008        1.186
> 6   7   1
>  14   2    4           200.015        6.696        0.865        0.839
> 4  11   1
>  15   2    3           215.013        5.881        0.804        0.863
> 3   5   3
>  16   2    0           298.363        7.187        0.580          -
> 8   6   1
>  17   2   -1(  8)      208.821       34.409        0.840        1.088
> 8   5   1
> 
> ------------------------------------------------------------
> Best performance was achieved with 6 PME nodes (see line 7)
> Optimized PME settings:
>   New Coulomb radius: 1.100000 nm (was 1.000000 nm)
>   New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
>   New Fourier grid xyz: 80 80 80 (was 96 96 96)
> Please use this command line to launch the simulation:
> 
> mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on
> 
> 
> Summary of successful runs:
> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> DD grid
>   0   0   25           283.628        2.191        0.610        1.749
> 5   9   3
>   1   0   20           240.888        9.132        0.719        1.618
> 5   4   7
>   2   0   16           166.570        0.394        1.038        1.239
> 8   6   3
>   3   0    0           435.389        3.399        0.397          -
> 10   8   2
>   4   0   -1( 20)      237.623        6.298        0.729        1.406
> 5   4   7
>   5   1   25           286.990        1.662        0.603        1.813
> 5   9   3
>   6   1   20           235.818        0.754        0.734        1.495
> 5   4   7
>   7   1   16           167.888        3.028        1.030        1.256
> 8   6   3
>   8   1    0           284.264        3.775        0.609          -
> 8   5   4
>   9   1   -1( 16)      167.858        1.924        1.030        1.303
> 8   6   3
>  10   2   25           298.637        1.660        0.579        1.696
> 5   9   3
>  11   2   20           281.647        1.074        0.614        1.296
> 5   4   7
>  12   2   16           184.012        4.022        0.941        1.244
> 8   6   3
>  13   2    0           304.658        0.793        0.568          -
> 8   5   4
>  14   2   -1( 16)      183.084        2.203        0.945        1.188
> 8   6   3
> 
> ------------------------------------------------------------
> Best performance was achieved with 16 PME nodes (see line 2)
> and original PME settings.
> Please use this command line to launch the simulation:
> 
> mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
> 4icl.tpr -pin on
> 
> 
> Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
> get on my workstation with Xeon W3550 3.07 GHz using 8 thread (1.431ns/day)
> for a similar system.
> The bench.log file generated by g_tune PME shows very high load imbalance
> (>60% -100 %). I have tried several combinations of np and npme but the
> perfomance is always in this range only.
> Can someone please tell me what is it that I am doing wrong or how can I
> decrease the simulation time.
> -- 
> Regards
> Ashutosh Srivastava
> -- 
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/grubmueller/kutzner
http://www.mpibpc.mpg.de/grubmueller/sppexa