[gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs

Thu Apr 25 16:29:48 CEST 2013

Thank you Berk,

I am still getting an error when I try with MPI compiled gromacs 4.6.1 and -np set as you suggested.

I ran like this:
mpirun -np 6 /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi -notunepme -deffnm md3 -dlb yes -npme -1 -cpt 60 -maxh 0.1 -cpi md3.cpt -nsteps 5000000000 -pin on

Here is the .log file output:
Log file opened on Thu Apr 25 10:24:55 2013
Host: kfs064  pid: 38106  nodeid: 0  nnodes:  6
Gromacs version:    VERSION 4.6.1
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled
GPU support:        enabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   AVX_256
FFT library:        fftw-3.3.3-sse2
Large file support: enabled
RDTSCP usage:       enabled
Built on:           Tue Apr 23 12:43:12 EDT 2013
Built by:           cneale at kfslogin2.nics.utk.edu [CMAKE]
Build OS/arch:      Linux 2.6.32-220.4.1.el6.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Build CPU family:   6   Model: 45   Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc Intel icc (ICC) 12.1.5 20120612
C compiler flags:   -mavx   -std=gnu99 -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
C++ compiler:       /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc Intel icpc (ICC) 12.1.5 20120612
C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda compilation tools, release 4.2, V0.2.1221
CUDA driver:        5.0
CUDA runtime:       4.20

                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.6.1  (-:

        Contributions from Mark Abraham, Emile Apol, Rossen Apostolov, 
           Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,  
     Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans, 
        Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
         Copyright (c) 2001-2012,2013, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
       modify it under the terms of the GNU Lesser General Public License
        as published by the Free Software Foundation; either version 2.1
             of the License, or (at your option) any later version.

    :-)  /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi  (-:

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Can not increase nstlist for GPU run because verlet-buffer-drift is not set or used
Input Parameters:
   integrator           = sd
   nsteps               = 5000000
   init-step            = 0
   cutoff-scheme        = Verlet
   ns_type              = Grid
   nstlist              = 10
   ndelta               = 2
   nstcomm              = 100
   comm-mode            = Linear
   nstlog               = 0
   nstxout              = 5000000
   nstvout              = 5000000
   nstfout              = 5000000
   nstcalcenergy        = 100
   nstenergy            = 50000
   nstxtcout            = 50000
   init-t               = 0
   delta-t              = 0.002
   xtcprec              = 1000
   fourierspacing       = 0.12
   nkx                  = 64
   nky                  = 64
   nkz                  = 80
   pme-order            = 4
   ewald-rtol           = 1e-05
   ewald-geometry       = 0
   epsilon-surface      = 0
   optimize-fft         = TRUE
   ePBC                 = xyz
   bPeriodicMols        = FALSE
   bContinuation        = FALSE
   bShakeSOR            = FALSE
   etc                  = No
   bPrintNHChains       = FALSE
   nsttcouple           = -1
   epc                  = Berendsen
   epctype              = Semiisotropic
   nstpcouple           = 10
   tau-p                = 4
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   compress (3x3):
      compress[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compress[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compress[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   refcoord-scaling     = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   verlet-buffer-drift  = -1
   rlist                = 1
   rlistlong            = 1
   nstcalclr            = 10
   rtpi                 = 0.05
   coulombtype          = PME
   coulomb-modifier     = Potential-shift
   rcoulomb-switch      = 0
   rcoulomb             = 1
   vdwtype              = Cut-off
   vdw-modifier         = Potential-shift
   rvdw-switch          = 0
   rvdw                 = 1
   epsilon-r            = 1
   epsilon-rf           = inf
   tabext               = 1
   implicit-solvent     = No
   gb-algorithm         = Still
   gb-epsilon-solvent   = 80
   nstgbradii           = 1
   rgbradii             = 1
   gb-saltconc          = 0
   gb-obc-alpha         = 1
   gb-obc-beta          = 0.8
   gb-obc-gamma         = 4.85
   gb-dielectric-offset = 0.009
   sa-algorithm         = Ace-approximation
   sa-surface-tension   = 2.05016
   DispCorr             = EnerPres
   bSimTemp             = FALSE
   free-energy          = no
   nwall                = 0
   wall-type            = 9-3
   wall-atomtype[0]     = -1
   wall-atomtype[1]     = -1
   wall-density[0]      = 0
   wall-density[1]      = 0
   wall-ewald-zfac      = 3
   pull                 = no
   rotation             = FALSE
   disre                = No
   disre-weighting      = Conservative
   disre-mixed          = FALSE
   dr-fc                = 1000
   dr-tau               = 0
   nstdisreout          = 100
   orires-fc            = 0
   orires-tau           = 0
   nstorireout          = 100
   dihre-fc             = 0
   em-stepsize          = 0.01
   em-tol               = 10
   niter                = 20
   fc-stepsize          = 0
   nstcgsteep           = 1000
   nbfgscorr            = 10
   ConstAlg             = Lincs
   shake-tol            = 0.0001
   lincs-order          = 6
   lincs-warnangle      = 30
   lincs-iter           = 1
   bd-fric              = 0
   ld-seed              = 29660
   cos-accel            = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   adress               = FALSE
   userint1             = 0
   userint2             = 0
   userint3             = 0
   userint4             = 0
   userreal1            = 0
   userreal2            = 0
   userreal3            = 0
   userreal4            = 0
grpopts:
   nrdf:      106748
   ref-t:         310
   tau-t:           1
anneal:          No
ann-npoints:           0
   acc:            0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0
   efield-x:
      n = 0
   efield-xt:
      n = 0
   efield-y:
      n = 0
   efield-yt:
      n = 0
   efield-z:
      n = 0
   efield-zt:
      n = 0
   bQMMM                = FALSE
   QMconstraints        = 0
   QMMMscheme           = 0
   scalefactor          = 1
qm-opts:
   ngQM                 = 0

Overriding nsteps with value passed on the command line: 705032704 steps, 1410065.408 ps

Initializing Domain Decomposition on 6 nodes
Dynamic load balancing: yes
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
  multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
Minimum cell size due to bonded interactions: 0.475 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
Estimated maximum distance required for P-LINCS: 1.175 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME nodes, per user request
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 6 cells with a minimum initial size of 1.469 nm
The maximum allowed number of cells is: X 5 Y 5 Z 6
Domain decomposition grid 3 x 1 x 2, separate PME nodes 0
PME domain decomposition: 6 x 1 x 1
Domain decomposition nodeid 0, coordinates 0 0 0

Using 6 MPI processes
Using 2 OpenMP threads per MPI process

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Family:  6  Model: 45  Stepping:  7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256

3 GPUs detected on host kfs064:
  #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
  #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
  #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.1
Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356

Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
mdrun_mpi was started with 6 PP MPI processes per node, but only 3 GPUs were detected.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

###################################################
###################################################
###################################################

And here is the stderr output:

                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.6.1  (-:

        Contributions from Mark Abraham, Emile Apol, Rossen Apostolov, 
           Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,  
     Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans, 
        Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
         Copyright (c) 2001-2012,2013, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
       modify it under the terms of the GNU Lesser General Public License
        as published by the Free Software Foundation; either version 2.1
             of the License, or (at your option) any later version.

    :-)  /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi  (-:

Option     Filename  Type         Description
------------------------------------------------------------
  -s        md3.tpr  Input        Run input file: tpr tpb tpa
  -o        md3.trr  Output       Full precision trajectory: trr trj cpt
  -x        md3.xtc  Output, Opt. Compressed trajectory (portable xdr format)
-cpi        md3.cpt  Input, Opt!  Checkpoint file
-cpo        md3.cpt  Output, Opt. Checkpoint file
  -c        md3.gro  Output       Structure file: gro g96 pdb etc.
  -e        md3.edr  Output       Energy file
  -g        md3.log  Output       Log file
-dhdl       md3.xvg  Output, Opt. xvgr/xmgr file
-field      md3.xvg  Output, Opt. xvgr/xmgr file
-table      md3.xvg  Input, Opt.  xvgr/xmgr file
-tabletf    md3.xvg  Input, Opt.  xvgr/xmgr file
-tablep     md3.xvg  Input, Opt.  xvgr/xmgr file
-tableb     md3.xvg  Input, Opt.  xvgr/xmgr file
-rerun      md3.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
-tpi        md3.xvg  Output, Opt. xvgr/xmgr file
-tpid       md3.xvg  Output, Opt. xvgr/xmgr file
 -ei        md3.edi  Input, Opt.  ED sampling input
 -eo        md3.xvg  Output, Opt. xvgr/xmgr file
  -j        md3.gct  Input, Opt.  General coupling stuff
 -jo        md3.gct  Output, Opt. General coupling stuff
-ffout      md3.xvg  Output, Opt. xvgr/xmgr file
-devout     md3.xvg  Output, Opt. xvgr/xmgr file
-runav      md3.xvg  Output, Opt. xvgr/xmgr file
 -px        md3.xvg  Output, Opt. xvgr/xmgr file
 -pf        md3.xvg  Output, Opt. xvgr/xmgr file
 -ro        md3.xvg  Output, Opt. xvgr/xmgr file
 -ra        md3.log  Output, Opt. Log file
 -rs        md3.log  Output, Opt. Log file
 -rt        md3.log  Output, Opt. Log file
-mtx        md3.mtx  Output, Opt. Hessian matrix
 -dn        md3.ndx  Output, Opt. Index file
-multidir       md3  Input, Opt., Mult. Run directory
-membed     md3.dat  Input, Opt.  Generic data file
 -mp        md3.top  Input, Opt.  Topology file
 -mn        md3.ndx  Input, Opt.  Index file

Option       Type   Value   Description
------------------------------------------------------
-[no]h       bool   no      Print help info and quit
-[no]version bool   no      Print version info and quit
-nice        int    0       Set the nicelevel
-deffnm      string md3     Set the default filename for all file options
-xvg         enum   xmgrace  xvg plot formatting: xmgrace, xmgr or none
-[no]pd      bool   no      Use particle decompostion
-dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
-ddorder     enum   interleave  DD node order: interleave, pp_pme or cartesian
-npme        int    -1      Number of separate nodes to be used for PME, -1
                            is guess
-nt          int    0       Total number of threads to start (0 is guess)
-ntmpi       int    0       Number of thread-MPI threads to start (0 is guess)
-ntomp       int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is guess)
-ntomp_pme   int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is -ntomp)
-pin         enum   on      Fix threads (or processes) to specific cores:
                            auto, on or off
-pinoffset   int    0       The starting logical core number for pinning to
                            cores; used to avoid pinning threads from
                            different mdrun instances to the same core
-pinstride   int    0       Pinning distance in logical cores for threads,
                            use 0 to minimize the number of threads per
                            physical core
-gpu_id      string         List of GPU id's to use
-[no]ddcheck bool   yes     Check for all bonded interactions with DD
-rdd         real   0       The maximum distance for bonded interactions with
                            DD (nm), 0 is determine from initial coordinates
-rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
-dlb         enum   yes     Dynamic load balancing (with DD): auto, no or yes
-dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
-gcom        int    -1      Global communication frequency
-nb          enum   auto    Calculate non-bonded interactions on: auto, cpu,
                            gpu or gpu_cpu
-[no]tunepme bool   no      Optimize PME load between PP/PME nodes or GPU/CPU
-[no]testverlet bool   no      Test the Verlet non-bonded scheme
-[no]v       bool   no      Be loud and noisy
-[no]compact bool   yes     Write a compact log file
-[no]seppot  bool   no      Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
-pforce      real   -1      Print all forces larger than this (kJ/mol nm)
-[no]reprod  bool   no      Try to avoid optimizations that affect binary
                            reproducibility
-cpt         real   60      Checkpoint interval (minutes)
-[no]cpnum   bool   no      Keep and number checkpoint files
-[no]append  bool   yes     Append to previous output files when continuing
                            from checkpoint instead of adding the simulation
                            part number to all file names
-nsteps      int    705032704  Run this number of steps, overrides .mdp file
                            option
-maxh        real   0.1     Terminate after 0.99 times this time (hours)
-multi       int    0       Do multiple simulations in parallel
-replex      int    0       Attempt replica exchange periodically with this
                            period (steps)
-nex         int    0       Number of random exchanges to carry out each
                            exchange interval (N^3 is one suggestion).  -nex
                            zero or not specified gives neighbor replica
                            exchange.
-reseed      int    -1      Seed for replica exchange, -1 is generate a seed
-[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
                            bombardment on your system

Reading file md3.tpr, VERSION 4.6.1 (single precision)
Can not increase nstlist for GPU run because verlet-buffer-drift is not set or used

Overriding nsteps with value passed on the command line: 705032704 steps, 1410065.408 ps

Using 6 MPI processes
Using 2 OpenMP threads per MPI process

3 GPUs detected on host kfs064:
  #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
  #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
  #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.1
Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356

Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
mdrun_mpi was started with 6 PP MPI processes per node, but only 3 GPUs were detected.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 0 out of 6

gcq#6: Thanx for Using GROMACS - Have a Nice Day

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 38106 on
node kfs064 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Thank you very much for you help,
Chris.