[gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs
Christopher Neale
chris.neale at mail.utoronto.ca
Thu Apr 25 16:29:48 CEST 2013
Thank you Berk,
I am still getting an error when I try with MPI compiled gromacs 4.6.1 and -np set as you suggested.
I ran like this:
mpirun -np 6 /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi -notunepme -deffnm md3 -dlb yes -npme -1 -cpt 60 -maxh 0.1 -cpi md3.cpt -nsteps 5000000000 -pin on
Here is the .log file output:
Log file opened on Thu Apr 25 10:24:55 2013
Host: kfs064 pid: 38106 nodeid: 0 nnodes: 6
Gromacs version: VERSION 4.6.1
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled
GPU support: enabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: AVX_256
FFT library: fftw-3.3.3-sse2
Large file support: enabled
RDTSCP usage: enabled
Built on: Tue Apr 23 12:43:12 EDT 2013
Built by: cneale at kfslogin2.nics.utk.edu [CMAKE]
Build OS/arch: Linux 2.6.32-220.4.1.el6.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Build CPU family: 6 Model: 45 Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc Intel icc (ICC) 12.1.5 20120612
C compiler flags: -mavx -std=gnu99 -Wall -ip -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc Intel icpc (ICC) 12.1.5 20120612
C++ compiler flags: -mavx -Wall -ip -funroll-all-loops -O3 -DNDEBUG
CUDA compiler: nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda compilation tools, release 4.2, V0.2.1221
CUDA driver: 5.0
CUDA runtime: 4.20
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 4.6.1 (-:
Contributions from Mark Abraham, Emile Apol, Rossen Apostolov,
Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,
Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans,
Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,
Berk Hess, David van der Spoel, and Erik Lindahl.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2012,2013, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
:-) /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi (-:
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Can not increase nstlist for GPU run because verlet-buffer-drift is not set or used
Input Parameters:
integrator = sd
nsteps = 5000000
init-step = 0
cutoff-scheme = Verlet
ns_type = Grid
nstlist = 10
ndelta = 2
nstcomm = 100
comm-mode = Linear
nstlog = 0
nstxout = 5000000
nstvout = 5000000
nstfout = 5000000
nstcalcenergy = 100
nstenergy = 50000
nstxtcout = 50000
init-t = 0
delta-t = 0.002
xtcprec = 1000
fourierspacing = 0.12
nkx = 64
nky = 64
nkz = 80
pme-order = 4
ewald-rtol = 1e-05
ewald-geometry = 0
epsilon-surface = 0
optimize-fft = TRUE
ePBC = xyz
bPeriodicMols = FALSE
bContinuation = FALSE
bShakeSOR = FALSE
etc = No
bPrintNHChains = FALSE
nsttcouple = -1
epc = Berendsen
epctype = Semiisotropic
nstpcouple = 10
tau-p = 4
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
compress (3x3):
compress[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compress[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compress[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
verlet-buffer-drift = -1
rlist = 1
rlistlong = 1
nstcalclr = 10
rtpi = 0.05
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1
vdwtype = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1
epsilon-r = 1
epsilon-rf = inf
tabext = 1
implicit-solvent = No
gb-algorithm = Still
gb-epsilon-solvent = 80
nstgbradii = 1
rgbradii = 1
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
DispCorr = EnerPres
bSimTemp = FALSE
free-energy = no
nwall = 0
wall-type = 9-3
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = no
rotation = FALSE
disre = No
disre-weighting = Conservative
disre-mixed = FALSE
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orires-fc = 0
orires-tau = 0
nstorireout = 100
dihre-fc = 0
em-stepsize = 0.01
em-tol = 10
niter = 20
fc-stepsize = 0
nstcgsteep = 1000
nbfgscorr = 10
ConstAlg = Lincs
shake-tol = 0.0001
lincs-order = 6
lincs-warnangle = 30
lincs-iter = 1
bd-fric = 0
ld-seed = 29660
cos-accel = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
adress = FALSE
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 106748
ref-t: 310
tau-t: 1
anneal: No
ann-npoints: 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
efield-x:
n = 0
efield-xt:
n = 0
efield-y:
n = 0
efield-yt:
n = 0
efield-z:
n = 0
efield-zt:
n = 0
bQMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
scalefactor = 1
qm-opts:
ngQM = 0
Overriding nsteps with value passed on the command line: 705032704 steps, 1410065.408 ps
Initializing Domain Decomposition on 6 nodes
Dynamic load balancing: yes
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
Minimum cell size due to bonded interactions: 0.475 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
Estimated maximum distance required for P-LINCS: 1.175 nm
This distance will limit the DD cell size, you can override this with -rcon
Using 0 separate PME nodes, per user request
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 6 cells with a minimum initial size of 1.469 nm
The maximum allowed number of cells is: X 5 Y 5 Z 6
Domain decomposition grid 3 x 1 x 2, separate PME nodes 0
PME domain decomposition: 6 x 1 x 1
Domain decomposition nodeid 0, coordinates 0 0 0
Using 6 MPI processes
Using 2 OpenMP threads per MPI process
Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Family: 6 Model: 45 Stepping: 7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256
3 GPUs detected on host kfs064:
#0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
#2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.1
Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
mdrun_mpi was started with 6 PP MPI processes per node, but only 3 GPUs were detected.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
###################################################
###################################################
###################################################
And here is the stderr output:
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 4.6.1 (-:
Contributions from Mark Abraham, Emile Apol, Rossen Apostolov,
Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,
Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans,
Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,
Berk Hess, David van der Spoel, and Erik Lindahl.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2012,2013, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
:-) /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi (-:
Option Filename Type Description
------------------------------------------------------------
-s md3.tpr Input Run input file: tpr tpb tpa
-o md3.trr Output Full precision trajectory: trr trj cpt
-x md3.xtc Output, Opt. Compressed trajectory (portable xdr format)
-cpi md3.cpt Input, Opt! Checkpoint file
-cpo md3.cpt Output, Opt. Checkpoint file
-c md3.gro Output Structure file: gro g96 pdb etc.
-e md3.edr Output Energy file
-g md3.log Output Log file
-dhdl md3.xvg Output, Opt. xvgr/xmgr file
-field md3.xvg Output, Opt. xvgr/xmgr file
-table md3.xvg Input, Opt. xvgr/xmgr file
-tabletf md3.xvg Input, Opt. xvgr/xmgr file
-tablep md3.xvg Input, Opt. xvgr/xmgr file
-tableb md3.xvg Input, Opt. xvgr/xmgr file
-rerun md3.xtc Input, Opt. Trajectory: xtc trr trj gro g96 pdb cpt
-tpi md3.xvg Output, Opt. xvgr/xmgr file
-tpid md3.xvg Output, Opt. xvgr/xmgr file
-ei md3.edi Input, Opt. ED sampling input
-eo md3.xvg Output, Opt. xvgr/xmgr file
-j md3.gct Input, Opt. General coupling stuff
-jo md3.gct Output, Opt. General coupling stuff
-ffout md3.xvg Output, Opt. xvgr/xmgr file
-devout md3.xvg Output, Opt. xvgr/xmgr file
-runav md3.xvg Output, Opt. xvgr/xmgr file
-px md3.xvg Output, Opt. xvgr/xmgr file
-pf md3.xvg Output, Opt. xvgr/xmgr file
-ro md3.xvg Output, Opt. xvgr/xmgr file
-ra md3.log Output, Opt. Log file
-rs md3.log Output, Opt. Log file
-rt md3.log Output, Opt. Log file
-mtx md3.mtx Output, Opt. Hessian matrix
-dn md3.ndx Output, Opt. Index file
-multidir md3 Input, Opt., Mult. Run directory
-membed md3.dat Input, Opt. Generic data file
-mp md3.top Input, Opt. Topology file
-mn md3.ndx Input, Opt. Index file
Option Type Value Description
------------------------------------------------------
-[no]h bool no Print help info and quit
-[no]version bool no Print version info and quit
-nice int 0 Set the nicelevel
-deffnm string md3 Set the default filename for all file options
-xvg enum xmgrace xvg plot formatting: xmgrace, xmgr or none
-[no]pd bool no Use particle decompostion
-dd vector 0 0 0 Domain decomposition grid, 0 is optimize
-ddorder enum interleave DD node order: interleave, pp_pme or cartesian
-npme int -1 Number of separate nodes to be used for PME, -1
is guess
-nt int 0 Total number of threads to start (0 is guess)
-ntmpi int 0 Number of thread-MPI threads to start (0 is guess)
-ntomp int 0 Number of OpenMP threads per MPI process/thread
to start (0 is guess)
-ntomp_pme int 0 Number of OpenMP threads per MPI process/thread
to start (0 is -ntomp)
-pin enum on Fix threads (or processes) to specific cores:
auto, on or off
-pinoffset int 0 The starting logical core number for pinning to
cores; used to avoid pinning threads from
different mdrun instances to the same core
-pinstride int 0 Pinning distance in logical cores for threads,
use 0 to minimize the number of threads per
physical core
-gpu_id string List of GPU id's to use
-[no]ddcheck bool yes Check for all bonded interactions with DD
-rdd real 0 The maximum distance for bonded interactions with
DD (nm), 0 is determine from initial coordinates
-rcon real 0 Maximum distance for P-LINCS (nm), 0 is estimate
-dlb enum yes Dynamic load balancing (with DD): auto, no or yes
-dds real 0.8 Minimum allowed dlb scaling of the DD cell size
-gcom int -1 Global communication frequency
-nb enum auto Calculate non-bonded interactions on: auto, cpu,
gpu or gpu_cpu
-[no]tunepme bool no Optimize PME load between PP/PME nodes or GPU/CPU
-[no]testverlet bool no Test the Verlet non-bonded scheme
-[no]v bool no Be loud and noisy
-[no]compact bool yes Write a compact log file
-[no]seppot bool no Write separate V and dVdl terms for each
interaction type and node to the log file(s)
-pforce real -1 Print all forces larger than this (kJ/mol nm)
-[no]reprod bool no Try to avoid optimizations that affect binary
reproducibility
-cpt real 60 Checkpoint interval (minutes)
-[no]cpnum bool no Keep and number checkpoint files
-[no]append bool yes Append to previous output files when continuing
from checkpoint instead of adding the simulation
part number to all file names
-nsteps int 705032704 Run this number of steps, overrides .mdp file
option
-maxh real 0.1 Terminate after 0.99 times this time (hours)
-multi int 0 Do multiple simulations in parallel
-replex int 0 Attempt replica exchange periodically with this
period (steps)
-nex int 0 Number of random exchanges to carry out each
exchange interval (N^3 is one suggestion). -nex
zero or not specified gives neighbor replica
exchange.
-reseed int -1 Seed for replica exchange, -1 is generate a seed
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system
Reading file md3.tpr, VERSION 4.6.1 (single precision)
Can not increase nstlist for GPU run because verlet-buffer-drift is not set or used
Overriding nsteps with value passed on the command line: 705032704 steps, 1410065.408 ps
Using 6 MPI processes
Using 2 OpenMP threads per MPI process
3 GPUs detected on host kfs064:
#0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
#2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.1
Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
mdrun_mpi was started with 6 PP MPI processes per node, but only 3 GPUs were detected.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 0 out of 6
gcq#6: Thanx for Using GROMACS - Have a Nice Day
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 38106 on
node kfs064 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Thank you very much for you help,
Chris.
More information about the gromacs.org_gmx-users
mailing list