[gmx-users] performance issue with the parallel implementation of gromacs

Thu Sep 19 16:06:43 CEST 2013

Thank you Carsten

Will surely try out the suggestions and get back to you

On Thu, Sep 19, 2013 at 1:52 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:

> Hi,
>
> make a scaling test and run on a single node only at first. So you can
> estimate what performance you can at most expect when going to more nodes.
>
> On a single node, you can also run with Gromacs' thread-MPI, thus
> eliminating the possibility that something with your MPI is wrong.
>
> There are lots of reasons why your parallel performance could be bad.
> Can you check that actually the Infiniband interconnect is used and
> not the Ethernet? It could also be that a single process is still
> running on any of your cores and eating up CPU time. Or maybe the
> pinning of threads to cores is not correct (what does md.log say
> about that?).
>
> Just a few ideas.
>
> Good luck!
>
> Carsten
>
>
> On Sep 19, 2013, at 8:07 AM, ashutosh srivastava <ashu4487 at gmail.com>
> wrote:
>
> > Hi
> >
> > I have been trying to run simulation on a cluster consisting of 24 nodes
> > Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and
> they
> > are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
> > system is TORQUE. However due to some issues with the parallel queue I
> have
> > been trying to run the simulations directly on the cluster using mpdboot
> > and mpirun.
> > Following is the mdp.out file that I am using for simulation
> > ; VARIOUS PREPROCESSING OPTIONS
> > ; Preprocessor information: use cpp syntax.
> > ; e.g.: -I/home/joe/doe -I/home/mary/roe
> > include                  =
> > ; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case
> sensitive)
> > define                   = -DPOSRES
> >
> > ; RUN CONTROL PARAMETERS
> > integrator               = md
> > ; Start time and timestep in ps
> > tinit                    = 0
> > dt                       = 0.002
> > nsteps                   = 250000
> > ; For exact run continuation or redoing part of a run
> > init-step                = 0
> > ; Part index is updated automatically on checkpointing (keeps files
> > separate)
> > simulation-part          = 1
> > ; mode for center of mass motion removal
> > comm-mode                = Linear
> > ; number of steps for center of mass motion removal
> > nstcomm                  = 100
> > ; group(s) for center of mass motion removal
> > comm-grps                =
> >
> > ; LANGEVIN DYNAMICS OPTIONS
> > ; Friction coefficient (amu/ps) and random seed
> > bd-fric                  = 0
> > ld-seed                  = 1993
> >
> > ; ENERGY MINIMIZATION OPTIONS
> > ; Force tolerance and initial step-size
> > emtol                    = 10
> > emstep                   = 0.01
> > ; Max number of iterations in relax-shells
> > niter                    = 20
> > ; Step size (ps^2) for minimization of flexible constraints
> > fcstep                   = 0
> > ; Frequency of steepest descents steps when doing CG
> > nstcgsteep               = 1000
> > nbfgscorr                = 10
> >
> > ; TEST PARTICLE INSERTION OPTIONS
> > rtpi                     = 0.05
> >
> > ; OUTPUT CONTROL OPTIONS
> > ; Output frequency for coords (x), velocities (v) and forces (f)
> > nstxout                  = 100
> > nstvout                  = 100
> > nstfout                  = 0
> > ; Output frequency for energies to log file and energy file
> > nstlog                   = 100
> > nstcalcenergy            = 100
> > nstenergy                = 100
> > ; Output frequency and precision for .xtc file
> > nstxtcout                = 0
> > xtc-precision            = 1000
> > ; This selects the subset of atoms for the .xtc file. You can
> > ; select multiple groups. By default all atoms will be written.
> > xtc-grps                 =
> > ; Selection of energy groups
> > energygrps               =
> >
> > ; NEIGHBORSEARCHING PARAMETERS
> > ; cut-off scheme (group: using charge groups, Verlet: particle based
> > cut-offs)
> > cutoff-scheme            = Group
> > ; nblist update frequency
> > nstlist                  = 5
> > ; ns algorithm (simple or grid)
> > ns_type                  = grid
> > ; Periodic boundary conditions: xyz, no, xy
> > pbc                      = xyz
> > periodic-molecules       = no
> > ; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom,
> > ; a value of -1 means: use rlist
> > verlet-buffer-drift      = 0.005
> > ; nblist cut-off
> > rlist                    = 1.0
> > ; long-range cut-off for switched potentials
> > rlistlong                = -1
> > nstcalclr                = -1
> >
> > ; OPTIONS FOR ELECTROSTATICS AND VDW
> > ; Method for doing electrostatics
> > coulombtype              = PME
> > coulomb-modifier         = Potential-shift-Verlet
> > rcoulomb-switch          = 0
> > rcoulomb                 = 1.0
> > ; Relative dielectric constant for the medium and the reaction field
> > epsilon-r                = 1
> > epsilon-rf               = 0
> > ; Method for doing Van der Waals
> > vdw-type                 = Cut-off
> > vdw-modifier             = Potential-shift-Verlet
> > ; cut-off lengths
> > rvdw-switch              = 0
> > rvdw                     = 1.0
> > ; Apply long range dispersion corrections for Energy and Pressure
> > DispCorr                 = EnerPres
> > ; Extension of the potential lookup tables beyond the cut-off
> > table-extension          = 1
> > ; Separate tables between energy group pairs
> > energygrp-table          =
> > ; Spacing for the PME/PPPM FFT grid
> > fourierspacing           = 0.16
> > ; FFT grid size, when a value is 0 fourierspacing will be used
> > fourier-nx               = 0
> > fourier-ny               = 0
> > fourier-nz               = 0
> > ; EWALD/PME/PPPM parameters
> > pme_order                = 4
> > ewald-rtol               = 1e-05
> > ewald-geometry           = 3d
> > epsilon-surface          = 0
> > optimize-fft             = no
> >
> > ; IMPLICIT SOLVENT ALGORITHM
> > implicit-solvent         = No
> >
> > ; GENERALIZED BORN ELECTROSTATICS
> > ; Algorithm for calculating Born radii
> > gb-algorithm             = Still
> > ; Frequency of calculating the Born radii inside rlist
> > nstgbradii               = 1
> > ; Cutoff for Born radii calculation; the contribution from atoms
> > ; between rlist and rgbradii is updated every nstlist steps
> > rgbradii                 = 1
> > ; Dielectric coefficient of the implicit solvent
> > gb-epsilon-solvent       = 80
> > ; Salt concentration in M for Generalized Born models
> > gb-saltconc              = 0
> > ; Scaling factors used in the OBC GB model. Default values are OBC(II)
> > gb-obc-alpha             = 1
> > gb-obc-beta              = 0.8
> > gb-obc-gamma             = 4.85
> > gb-dielectric-offset     = 0.009
> > sa-algorithm             = Ace-approximation
> > ; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of
> GBSA
> > ; The value -1 will set default value for Still/HCT/OBC GB-models.
> > sa-surface-tension       = -1
> >
> > ; OPTIONS FOR WEAK COUPLING ALGORITHMS
> > ; Temperature coupling
> > tcoupl                   = V-rescale
> > nsttcouple               = -1
> > nh-chain-length          = 10
> > print-nose-hoover-chain-variables = no
> > ; Groups to couple separately
> > tc-grps                  = Protein Non-Protein
> > ; Time constant (ps) and reference temperature (K)
> > tau_t                    = 0.1    0.1
> > ref_t                    = 300     300
> > ; pressure coupling
> > pcoupl                   = no
> > pcoupltype               = Isotropic
> > nstpcouple               = -1
> > ; Time constant (ps), compressibility (1/bar) and reference P (bar)
> > tau-p                    = 1
> > compressibility          =
> > ref-p                    =
> > ; Scaling of reference coordinates, No, All or COM
> > refcoord-scaling         = No
> >
> > ; OPTIONS FOR QMMM calculations
> > QMMM                     = no
> > ; Groups treated Quantum Mechanically
> > QMMM-grps                =
> > ; QM method
> > QMmethod                 =
> > ; QMMM scheme
> > QMMMscheme               = normal
> > ; QM basisset
> > QMbasis                  =
> > ; QM charge
> > QMcharge                 =
> > ; QM multiplicity
> > QMmult                   =
> > ; Surface Hopping
> > SH                       =
> > ; CAS space options
> > CASorbitals              =
> > CASelectrons             =
> > SAon                     =
> > SAoff                    =
> > SAsteps                  =
> > ; Scale factor for MM charges
> > MMChargeScaleFactor      = 1
> > ; Optimization of QM subsystem
> > bOPT                     =
> > bTS                      =
> >
> > ; SIMULATED ANNEALING
> > ; Type of annealing for each temperature group (no/single/periodic)
> > annealing                =
> > ; Number of time points to use for specifying annealing in each group
> > annealing-npoints        =
> > ; List of times at the annealing points for each group
> > annealing-time           =
> > ; Temp. at each annealing point, for each group.
> > annealing-temp           =
> >
> > ; GENERATE VELOCITIES FOR STARTUP RUN
> > gen_vel                  = yes
> > gen_temp                 = 300
> > gen_seed                 = -1
> >
> > ; OPTIONS FOR BONDS
> > constraints              = all-bonds
> > ; Type of constraint algorithm
> > constraint_algorithm     = lincs
> > ; Do not constrain the start configuration
> > continuation             = no
> > ; Use successive overrelaxation to reduce the number of shake iterations
> > Shake-SOR                = no
> > ; Relative tolerance of shake
> > shake-tol                = 0.0001
> > ; Highest order in the expansion of the constraint coupling matrix
> > lincs_order              = 4
> > ; Number of iterations in the final step of LINCS. 1 is fine for
> > ; normal simulations, but use 2 to conserve energy in NVE runs.
> > ; For energy minimization with constraints it should be 4 to 8.
> > lincs_iter               = 1
> > ; Lincs will write a warning to the stderr if in one step a bond
> > ; rotates over more degrees than
> > lincs-warnangle          = 30
> > ; Convert harmonic bonds to morse potentials
> > morse                    = no
> >
> > ; ENERGY GROUP EXCLUSIONS
> > ; Pairs of energy groups for which all non-bonded interactions are
> excluded
> > energygrp-excl           =
> >
> > ; WALLS
> > ; Number of walls, type, atom types, densities and box-z scale factor for
> > Ewald
> > nwall                    = 0
> > wall-type                = 9-3
> > wall-r-linpot            = -1
> > wall-atomtype            =
> > wall-density             =
> > wall-ewald-zfac          = 3
> >
> > ; COM PULLING
> > ; Pull type: no, umbrella, constraint or constant-force
> > pull                     = no
> >
> > ; ENFORCED ROTATION
> > ; Enforced rotation: No or Yes
> > rotation                 = no
> >
> > ; NMR refinement stuff
> > ; Distance restraints type: No, Simple or Ensemble
> > disre                    = No
> > ; Force weighting of pairs in one distance restraint: Conservative or
> Equal
> > disre-weighting          = Conservative
> > ; Use sqrt of the time averaged times the instantaneous violation
> > disre-mixed              = no
> > disre-fc                 = 1000
> > disre-tau                = 0
> > ; Output frequency for pair distances to energy file
> > nstdisreout              = 100
> > ; Orientation restraints: No or Yes
> > orire                    = no
> > ; Orientation restraints force constant and tau for time averaging
> > orire-fc                 = 0
> > orire-tau                = 0
> > orire-fitgrp             =
> > ; Output frequency for trace(SD) and S to energy file
> > nstorireout              = 100
> >
> > ; Free energy variables
> > free-energy              = no
> > couple-moltype           =
> > couple-lambda0           = vdw-q
> > couple-lambda1           = vdw-q
> > couple-intramol          = no
> > init-lambda              = -1
> > init-lambda-state        = -1
> > delta-lambda             = 0
> > nstdhdl                  = 50
> > fep-lambdas              =
> > mass-lambdas             =
> > coul-lambdas             =
> > vdw-lambdas              =
> > bonded-lambdas           =
> > restraint-lambdas        =
> > temperature-lambdas      =
> > calc-lambda-neighbors    = 1
> > init-lambda-weights      =
> > dhdl-print-energy        = no
> > sc-alpha                 = 0
> > sc-power                 = 1
> > sc-r-power               = 6
> > sc-sigma                 = 0.3
> > sc-coul                  = no
> > separate-dhdl-file       = yes
> > dhdl-derivatives         = yes
> > dh_hist_size             = 0
> > dh_hist_spacing          = 0.1
> >
> > ; Non-equilibrium MD stuff
> > acc-grps                 =
> > accelerate               =
> > freezegrps               =
> > freezedim                =
> > cos-acceleration         = 0
> > deform                   =
> >
> > ; simulated tempering variables
> > simulated-tempering      = no
> > simulated-tempering-scaling = geometric
> > sim-temp-low             = 300
> > sim-temp-high            = 300
> >
> > ; Electric fields
> > ; Format is number of terms (int) and for all terms an amplitude (real)
> > ; and a phase angle (real)
> > E-x                      =
> > E-xt                     =
> > E-y                      =
> > E-yt                     =
> > E-z                      =
> > E-zt                     =
> >
> > ; AdResS parameters
> > adress                   = no
> >
> > ; User defined thingies
> > user1-grps               =
> > user2-grps               =
> > userint1                 = 0
> > userint2                 = 0
> > userint3                 = 0
> > userint4                 = 0
> > userreal1                = 0
> > userreal2                = 0
> > userreal3                = 0
> > userreal4                = 0
> >
> >
> > The system has 250853 atoms. I used g_tune_pme in order to check the
> > performance with different number of processors
> > Following are the perf.out for 48 and 160 processors respectively
> >
> > Summary of successful runs:
> > Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> > DD grid
> >   0   0    8           181.713        7.698        0.952        1.334
> > 8   5   1
> >   1   0    6           156.720        4.086        1.104        1.420
> > 6   7   1
> >   2   0    4           196.320       16.161        0.885        0.916
> > 4  11   1
> >   3   0    3           195.312        1.127        0.886        0.840
> > 3   5   3
> >   4   0    0           370.539       12.942        0.468          -
> > 8   6   1
> >   5   0   -1(  8)      185.688        0.839        0.932        1.322
> > 8   5   1
> >   6   1    8           185.651       14.798        0.934        1.294
> > 8   5   1
> >   7   1    6           155.970        3.320        1.110        1.157
> > 6   7   1
> >   8   1    4           177.021       15.459        0.980        1.005
> > 4  11   1
> >   9   1    3           190.704       22.673        0.914        0.931
> > 3   5   3
> >  10   1    0           293.676        5.460        0.589          -
> > 8   6   1
> >  11   1   -1(  8)      188.978        3.686        0.915        1.266
> > 8   5   1
> >  12   2    8           210.631       17.457        0.824        1.176
> > 8   5   1
> >  13   2    6           171.926       10.462        1.008        1.186
> > 6   7   1
> >  14   2    4           200.015        6.696        0.865        0.839
> > 4  11   1
> >  15   2    3           215.013        5.881        0.804        0.863
> > 3   5   3
> >  16   2    0           298.363        7.187        0.580          -
> > 8   6   1
> >  17   2   -1(  8)      208.821       34.409        0.840        1.088
> > 8   5   1
> >
> > ------------------------------------------------------------
> > Best performance was achieved with 6 PME nodes (see line 7)
> > Optimized PME settings:
> >   New Coulomb radius: 1.100000 nm (was 1.000000 nm)
> >   New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
> >   New Fourier grid xyz: 80 80 80 (was 96 96 96)
> > Please use this command line to launch the simulation:
> >
> > mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on
> >
> >
> > Summary of successful runs:
> > Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        PME/f
> > DD grid
> >   0   0   25           283.628        2.191        0.610        1.749
> > 5   9   3
> >   1   0   20           240.888        9.132        0.719        1.618
> > 5   4   7
> >   2   0   16           166.570        0.394        1.038        1.239
> > 8   6   3
> >   3   0    0           435.389        3.399        0.397          -
> > 10   8   2
> >   4   0   -1( 20)      237.623        6.298        0.729        1.406
> > 5   4   7
> >   5   1   25           286.990        1.662        0.603        1.813
> > 5   9   3
> >   6   1   20           235.818        0.754        0.734        1.495
> > 5   4   7
> >   7   1   16           167.888        3.028        1.030        1.256
> > 8   6   3
> >   8   1    0           284.264        3.775        0.609          -
> > 8   5   4
> >   9   1   -1( 16)      167.858        1.924        1.030        1.303
> > 8   6   3
> >  10   2   25           298.637        1.660        0.579        1.696
> > 5   9   3
> >  11   2   20           281.647        1.074        0.614        1.296
> > 5   4   7
> >  12   2   16           184.012        4.022        0.941        1.244
> > 8   6   3
> >  13   2    0           304.658        0.793        0.568          -
> > 8   5   4
> >  14   2   -1( 16)      183.084        2.203        0.945        1.188
> > 8   6   3
> >
> > ------------------------------------------------------------
> > Best performance was achieved with 16 PME nodes (see line 2)
> > and original PME settings.
> > Please use this command line to launch the simulation:
> >
> > mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
> > 4icl.tpr -pin on
> >
> >
> > Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
> > get on my workstation with Xeon W3550 3.07 GHz using 8 thread
> (1.431ns/day)
> > for a similar system.
> > The bench.log file generated by g_tune PME shows very high load imbalance
> > (>60% -100 %). I have tried several combinations of np and npme but the
> > perfomance is always in this range only.
> > Can someone please tell me what is it that I am doing wrong or how can I
> > decrease the simulation time.
> > --
> > Regards
> > Ashutosh Srivastava
> > --
> > gmx-users mailing list    gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > * Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
> --
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry
> Theoretical and Computational Biophysics
> Am Fassberg 11, 37077 Goettingen, Germany
> Tel. +49-551-2012313, Fax: +49-551-2012302
> http://www.mpibpc.mpg.de/grubmueller/kutzner
> http://www.mpibpc.mpg.de/grubmueller/sppexa
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>

-- 
Regards
Ashutosh Srivastava