[gmx-users] performance issue with the parallel implementation of gromacs
ashutosh srivastava
ashu4487 at gmail.com
Thu Sep 19 16:06:43 CEST 2013
Thank you Carsten
Will surely try out the suggestions and get back to you
On Thu, Sep 19, 2013 at 1:52 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
> Hi,
>
> make a scaling test and run on a single node only at first. So you can
> estimate what performance you can at most expect when going to more nodes.
>
> On a single node, you can also run with Gromacs' thread-MPI, thus
> eliminating the possibility that something with your MPI is wrong.
>
> There are lots of reasons why your parallel performance could be bad.
> Can you check that actually the Infiniband interconnect is used and
> not the Ethernet? It could also be that a single process is still
> running on any of your cores and eating up CPU time. Or maybe the
> pinning of threads to cores is not correct (what does md.log say
> about that?).
>
> Just a few ideas.
>
> Good luck!
>
> Carsten
>
>
> On Sep 19, 2013, at 8:07 AM, ashutosh srivastava <ashu4487 at gmail.com>
> wrote:
>
> > Hi
> >
> > I have been trying to run simulation on a cluster consisting of 24 nodes
> > Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and
> they
> > are connected via 1Gbit Ethernet and Infiniband interconnect. The batch
> > system is TORQUE. However due to some issues with the parallel queue I
> have
> > been trying to run the simulations directly on the cluster using mpdboot
> > and mpirun.
> > Following is the mdp.out file that I am using for simulation
> > ; VARIOUS PREPROCESSING OPTIONS
> > ; Preprocessor information: use cpp syntax.
> > ; e.g.: -I/home/joe/doe -I/home/mary/roe
> > include =
> > ; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case
> sensitive)
> > define = -DPOSRES
> >
> > ; RUN CONTROL PARAMETERS
> > integrator = md
> > ; Start time and timestep in ps
> > tinit = 0
> > dt = 0.002
> > nsteps = 250000
> > ; For exact run continuation or redoing part of a run
> > init-step = 0
> > ; Part index is updated automatically on checkpointing (keeps files
> > separate)
> > simulation-part = 1
> > ; mode for center of mass motion removal
> > comm-mode = Linear
> > ; number of steps for center of mass motion removal
> > nstcomm = 100
> > ; group(s) for center of mass motion removal
> > comm-grps =
> >
> > ; LANGEVIN DYNAMICS OPTIONS
> > ; Friction coefficient (amu/ps) and random seed
> > bd-fric = 0
> > ld-seed = 1993
> >
> > ; ENERGY MINIMIZATION OPTIONS
> > ; Force tolerance and initial step-size
> > emtol = 10
> > emstep = 0.01
> > ; Max number of iterations in relax-shells
> > niter = 20
> > ; Step size (ps^2) for minimization of flexible constraints
> > fcstep = 0
> > ; Frequency of steepest descents steps when doing CG
> > nstcgsteep = 1000
> > nbfgscorr = 10
> >
> > ; TEST PARTICLE INSERTION OPTIONS
> > rtpi = 0.05
> >
> > ; OUTPUT CONTROL OPTIONS
> > ; Output frequency for coords (x), velocities (v) and forces (f)
> > nstxout = 100
> > nstvout = 100
> > nstfout = 0
> > ; Output frequency for energies to log file and energy file
> > nstlog = 100
> > nstcalcenergy = 100
> > nstenergy = 100
> > ; Output frequency and precision for .xtc file
> > nstxtcout = 0
> > xtc-precision = 1000
> > ; This selects the subset of atoms for the .xtc file. You can
> > ; select multiple groups. By default all atoms will be written.
> > xtc-grps =
> > ; Selection of energy groups
> > energygrps =
> >
> > ; NEIGHBORSEARCHING PARAMETERS
> > ; cut-off scheme (group: using charge groups, Verlet: particle based
> > cut-offs)
> > cutoff-scheme = Group
> > ; nblist update frequency
> > nstlist = 5
> > ; ns algorithm (simple or grid)
> > ns_type = grid
> > ; Periodic boundary conditions: xyz, no, xy
> > pbc = xyz
> > periodic-molecules = no
> > ; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom,
> > ; a value of -1 means: use rlist
> > verlet-buffer-drift = 0.005
> > ; nblist cut-off
> > rlist = 1.0
> > ; long-range cut-off for switched potentials
> > rlistlong = -1
> > nstcalclr = -1
> >
> > ; OPTIONS FOR ELECTROSTATICS AND VDW
> > ; Method for doing electrostatics
> > coulombtype = PME
> > coulomb-modifier = Potential-shift-Verlet
> > rcoulomb-switch = 0
> > rcoulomb = 1.0
> > ; Relative dielectric constant for the medium and the reaction field
> > epsilon-r = 1
> > epsilon-rf = 0
> > ; Method for doing Van der Waals
> > vdw-type = Cut-off
> > vdw-modifier = Potential-shift-Verlet
> > ; cut-off lengths
> > rvdw-switch = 0
> > rvdw = 1.0
> > ; Apply long range dispersion corrections for Energy and Pressure
> > DispCorr = EnerPres
> > ; Extension of the potential lookup tables beyond the cut-off
> > table-extension = 1
> > ; Separate tables between energy group pairs
> > energygrp-table =
> > ; Spacing for the PME/PPPM FFT grid
> > fourierspacing = 0.16
> > ; FFT grid size, when a value is 0 fourierspacing will be used
> > fourier-nx = 0
> > fourier-ny = 0
> > fourier-nz = 0
> > ; EWALD/PME/PPPM parameters
> > pme_order = 4
> > ewald-rtol = 1e-05
> > ewald-geometry = 3d
> > epsilon-surface = 0
> > optimize-fft = no
> >
> > ; IMPLICIT SOLVENT ALGORITHM
> > implicit-solvent = No
> >
> > ; GENERALIZED BORN ELECTROSTATICS
> > ; Algorithm for calculating Born radii
> > gb-algorithm = Still
> > ; Frequency of calculating the Born radii inside rlist
> > nstgbradii = 1
> > ; Cutoff for Born radii calculation; the contribution from atoms
> > ; between rlist and rgbradii is updated every nstlist steps
> > rgbradii = 1
> > ; Dielectric coefficient of the implicit solvent
> > gb-epsilon-solvent = 80
> > ; Salt concentration in M for Generalized Born models
> > gb-saltconc = 0
> > ; Scaling factors used in the OBC GB model. Default values are OBC(II)
> > gb-obc-alpha = 1
> > gb-obc-beta = 0.8
> > gb-obc-gamma = 4.85
> > gb-dielectric-offset = 0.009
> > sa-algorithm = Ace-approximation
> > ; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of
> GBSA
> > ; The value -1 will set default value for Still/HCT/OBC GB-models.
> > sa-surface-tension = -1
> >
> > ; OPTIONS FOR WEAK COUPLING ALGORITHMS
> > ; Temperature coupling
> > tcoupl = V-rescale
> > nsttcouple = -1
> > nh-chain-length = 10
> > print-nose-hoover-chain-variables = no
> > ; Groups to couple separately
> > tc-grps = Protein Non-Protein
> > ; Time constant (ps) and reference temperature (K)
> > tau_t = 0.1 0.1
> > ref_t = 300 300
> > ; pressure coupling
> > pcoupl = no
> > pcoupltype = Isotropic
> > nstpcouple = -1
> > ; Time constant (ps), compressibility (1/bar) and reference P (bar)
> > tau-p = 1
> > compressibility =
> > ref-p =
> > ; Scaling of reference coordinates, No, All or COM
> > refcoord-scaling = No
> >
> > ; OPTIONS FOR QMMM calculations
> > QMMM = no
> > ; Groups treated Quantum Mechanically
> > QMMM-grps =
> > ; QM method
> > QMmethod =
> > ; QMMM scheme
> > QMMMscheme = normal
> > ; QM basisset
> > QMbasis =
> > ; QM charge
> > QMcharge =
> > ; QM multiplicity
> > QMmult =
> > ; Surface Hopping
> > SH =
> > ; CAS space options
> > CASorbitals =
> > CASelectrons =
> > SAon =
> > SAoff =
> > SAsteps =
> > ; Scale factor for MM charges
> > MMChargeScaleFactor = 1
> > ; Optimization of QM subsystem
> > bOPT =
> > bTS =
> >
> > ; SIMULATED ANNEALING
> > ; Type of annealing for each temperature group (no/single/periodic)
> > annealing =
> > ; Number of time points to use for specifying annealing in each group
> > annealing-npoints =
> > ; List of times at the annealing points for each group
> > annealing-time =
> > ; Temp. at each annealing point, for each group.
> > annealing-temp =
> >
> > ; GENERATE VELOCITIES FOR STARTUP RUN
> > gen_vel = yes
> > gen_temp = 300
> > gen_seed = -1
> >
> > ; OPTIONS FOR BONDS
> > constraints = all-bonds
> > ; Type of constraint algorithm
> > constraint_algorithm = lincs
> > ; Do not constrain the start configuration
> > continuation = no
> > ; Use successive overrelaxation to reduce the number of shake iterations
> > Shake-SOR = no
> > ; Relative tolerance of shake
> > shake-tol = 0.0001
> > ; Highest order in the expansion of the constraint coupling matrix
> > lincs_order = 4
> > ; Number of iterations in the final step of LINCS. 1 is fine for
> > ; normal simulations, but use 2 to conserve energy in NVE runs.
> > ; For energy minimization with constraints it should be 4 to 8.
> > lincs_iter = 1
> > ; Lincs will write a warning to the stderr if in one step a bond
> > ; rotates over more degrees than
> > lincs-warnangle = 30
> > ; Convert harmonic bonds to morse potentials
> > morse = no
> >
> > ; ENERGY GROUP EXCLUSIONS
> > ; Pairs of energy groups for which all non-bonded interactions are
> excluded
> > energygrp-excl =
> >
> > ; WALLS
> > ; Number of walls, type, atom types, densities and box-z scale factor for
> > Ewald
> > nwall = 0
> > wall-type = 9-3
> > wall-r-linpot = -1
> > wall-atomtype =
> > wall-density =
> > wall-ewald-zfac = 3
> >
> > ; COM PULLING
> > ; Pull type: no, umbrella, constraint or constant-force
> > pull = no
> >
> > ; ENFORCED ROTATION
> > ; Enforced rotation: No or Yes
> > rotation = no
> >
> > ; NMR refinement stuff
> > ; Distance restraints type: No, Simple or Ensemble
> > disre = No
> > ; Force weighting of pairs in one distance restraint: Conservative or
> Equal
> > disre-weighting = Conservative
> > ; Use sqrt of the time averaged times the instantaneous violation
> > disre-mixed = no
> > disre-fc = 1000
> > disre-tau = 0
> > ; Output frequency for pair distances to energy file
> > nstdisreout = 100
> > ; Orientation restraints: No or Yes
> > orire = no
> > ; Orientation restraints force constant and tau for time averaging
> > orire-fc = 0
> > orire-tau = 0
> > orire-fitgrp =
> > ; Output frequency for trace(SD) and S to energy file
> > nstorireout = 100
> >
> > ; Free energy variables
> > free-energy = no
> > couple-moltype =
> > couple-lambda0 = vdw-q
> > couple-lambda1 = vdw-q
> > couple-intramol = no
> > init-lambda = -1
> > init-lambda-state = -1
> > delta-lambda = 0
> > nstdhdl = 50
> > fep-lambdas =
> > mass-lambdas =
> > coul-lambdas =
> > vdw-lambdas =
> > bonded-lambdas =
> > restraint-lambdas =
> > temperature-lambdas =
> > calc-lambda-neighbors = 1
> > init-lambda-weights =
> > dhdl-print-energy = no
> > sc-alpha = 0
> > sc-power = 1
> > sc-r-power = 6
> > sc-sigma = 0.3
> > sc-coul = no
> > separate-dhdl-file = yes
> > dhdl-derivatives = yes
> > dh_hist_size = 0
> > dh_hist_spacing = 0.1
> >
> > ; Non-equilibrium MD stuff
> > acc-grps =
> > accelerate =
> > freezegrps =
> > freezedim =
> > cos-acceleration = 0
> > deform =
> >
> > ; simulated tempering variables
> > simulated-tempering = no
> > simulated-tempering-scaling = geometric
> > sim-temp-low = 300
> > sim-temp-high = 300
> >
> > ; Electric fields
> > ; Format is number of terms (int) and for all terms an amplitude (real)
> > ; and a phase angle (real)
> > E-x =
> > E-xt =
> > E-y =
> > E-yt =
> > E-z =
> > E-zt =
> >
> > ; AdResS parameters
> > adress = no
> >
> > ; User defined thingies
> > user1-grps =
> > user2-grps =
> > userint1 = 0
> > userint2 = 0
> > userint3 = 0
> > userint4 = 0
> > userreal1 = 0
> > userreal2 = 0
> > userreal3 = 0
> > userreal4 = 0
> >
> >
> > The system has 250853 atoms. I used g_tune_pme in order to check the
> > performance with different number of processors
> > Following are the perf.out for 48 and 160 processors respectively
> >
> > Summary of successful runs:
> > Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f
> > DD grid
> > 0 0 8 181.713 7.698 0.952 1.334
> > 8 5 1
> > 1 0 6 156.720 4.086 1.104 1.420
> > 6 7 1
> > 2 0 4 196.320 16.161 0.885 0.916
> > 4 11 1
> > 3 0 3 195.312 1.127 0.886 0.840
> > 3 5 3
> > 4 0 0 370.539 12.942 0.468 -
> > 8 6 1
> > 5 0 -1( 8) 185.688 0.839 0.932 1.322
> > 8 5 1
> > 6 1 8 185.651 14.798 0.934 1.294
> > 8 5 1
> > 7 1 6 155.970 3.320 1.110 1.157
> > 6 7 1
> > 8 1 4 177.021 15.459 0.980 1.005
> > 4 11 1
> > 9 1 3 190.704 22.673 0.914 0.931
> > 3 5 3
> > 10 1 0 293.676 5.460 0.589 -
> > 8 6 1
> > 11 1 -1( 8) 188.978 3.686 0.915 1.266
> > 8 5 1
> > 12 2 8 210.631 17.457 0.824 1.176
> > 8 5 1
> > 13 2 6 171.926 10.462 1.008 1.186
> > 6 7 1
> > 14 2 4 200.015 6.696 0.865 0.839
> > 4 11 1
> > 15 2 3 215.013 5.881 0.804 0.863
> > 3 5 3
> > 16 2 0 298.363 7.187 0.580 -
> > 8 6 1
> > 17 2 -1( 8) 208.821 34.409 0.840 1.088
> > 8 5 1
> >
> > ------------------------------------------------------------
> > Best performance was achieved with 6 PME nodes (see line 7)
> > Optimized PME settings:
> > New Coulomb radius: 1.100000 nm (was 1.000000 nm)
> > New Van der Waals radius: 1.100000 nm (was 1.000000 nm)
> > New Fourier grid xyz: 80 80 80 (was 96 96 96)
> > Please use this command line to launch the simulation:
> >
> > mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on
> >
> >
> > Summary of successful runs:
> > Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f
> > DD grid
> > 0 0 25 283.628 2.191 0.610 1.749
> > 5 9 3
> > 1 0 20 240.888 9.132 0.719 1.618
> > 5 4 7
> > 2 0 16 166.570 0.394 1.038 1.239
> > 8 6 3
> > 3 0 0 435.389 3.399 0.397 -
> > 10 8 2
> > 4 0 -1( 20) 237.623 6.298 0.729 1.406
> > 5 4 7
> > 5 1 25 286.990 1.662 0.603 1.813
> > 5 9 3
> > 6 1 20 235.818 0.754 0.734 1.495
> > 5 4 7
> > 7 1 16 167.888 3.028 1.030 1.256
> > 8 6 3
> > 8 1 0 284.264 3.775 0.609 -
> > 8 5 4
> > 9 1 -1( 16) 167.858 1.924 1.030 1.303
> > 8 6 3
> > 10 2 25 298.637 1.660 0.579 1.696
> > 5 9 3
> > 11 2 20 281.647 1.074 0.614 1.296
> > 5 4 7
> > 12 2 16 184.012 4.022 0.941 1.244
> > 8 6 3
> > 13 2 0 304.658 0.793 0.568 -
> > 8 5 4
> > 14 2 -1( 16) 183.084 2.203 0.945 1.188
> > 8 6 3
> >
> > ------------------------------------------------------------
> > Best performance was achieved with 16 PME nodes (see line 2)
> > and original PME settings.
> > Please use this command line to launch the simulation:
> >
> > mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s
> > 4icl.tpr -pin on
> >
> >
> > Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I
> > get on my workstation with Xeon W3550 3.07 GHz using 8 thread
> (1.431ns/day)
> > for a similar system.
> > The bench.log file generated by g_tune PME shows very high load imbalance
> > (>60% -100 %). I have tried several combinations of np and npme but the
> > perfomance is always in this range only.
> > Can someone please tell me what is it that I am doing wrong or how can I
> > decrease the simulation time.
> > --
> > Regards
> > Ashutosh Srivastava
> > --
> > gmx-users mailing list gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > * Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
> --
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry
> Theoretical and Computational Biophysics
> Am Fassberg 11, 37077 Goettingen, Germany
> Tel. +49-551-2012313, Fax: +49-551-2012302
> http://www.mpibpc.mpg.de/grubmueller/kutzner
> http://www.mpibpc.mpg.de/grubmueller/sppexa
>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
--
Regards
Ashutosh Srivastava
More information about the gromacs.org_gmx-users
mailing list