[gmx-users] How to redirect the calculation load toward GPU
Dario Corrada
dario.corrada at gmail.com
Tue Apr 8 17:19:00 CEST 2014
Here below are listed a chunk of the log file:
Log file opened on Tue Apr 8 09:57:06 2014
Host: Obsidian03 pid: 10221 nodeid: 0 nnodes: 1
Gromacs version: VERSION 4.6.5
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled
GPU support: enabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: SSE2
FFT library: fftw-3.3.3-sse2
Large file support: enabled
RDTSCP usage: enabled
Built on: Thu Feb 13 17:01:44 CET 2014
Built by: portage at Chlorine01 [CMAKE]
Build OS/arch: Linux 3.10.17-gentoo-Generic-x64 x86_64
Build CPU vendor: AuthenticAMD
Build CPU brand: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
Build CPU family: 15 Model: 107 Stepping: 2
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr pse
rdtscp sse2 sse3
C compiler: /usr/bin/x86_64-pc-linux-gnu-gcc GNU
x86_64-pc-linux-gnu-gcc (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3
C compiler flags: -msse2 -Wextra -Wno-missing-field-initializers
-Wno-sign-compare -Wall -Wno-unused -Wunused-value -march=native -O2
-pipe -fomit-frame-pointer
C++ compiler: /usr/bin/x86_64-pc-linux-gnu-g++ GNU
x86_64-pc-linux-gnu-g++ (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3
C++ compiler flags: -msse2 -Wextra -Wno-missing-field-initializers
-Wno-sign-compare -Wall -Wno-unused -Wunused-value -march=native -O2
-pipe -fomit-frame-pointer
CUDA compiler: /opt/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
;
-msse2;-Wextra;-Wno-missing-field-initializers;-Wno-sign-compare;-Wall;-Wno-unused;-Wunused-value;-march=native;-O2;-pipe;-fomit-frame-pointer;
CUDA driver: 6.0
CUDA runtime: 5.50
:-) G R O M A C S (-:
Glycine aRginine prOline Methionine Alanine Cystine Serine
:-) VERSION 4.6.5 (-:
Contributions from Mark Abraham, Emile Apol, Rossen Apostolov,
Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,
Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans,
Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,
Berk Hess, David van der Spoel, and Erik Lindahl.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2012,2013, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
:-) mdrun (-:
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
For optimal performance with a GPU nstlist (now 5) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 5 to 40, rlist from 0.9 to 0.999
Input Parameters:
integrator = md
nsteps = 500000
init-step = 0
cutoff-scheme = Verlet
ns_type = Grid
nstlist = 40
ndelta = 2
nstcomm = 100
comm-mode = Linear
nstlog = 500
nstxout = 10000
nstvout = 10000
nstfout = 10000
nstcalcenergy = 100
nstenergy = 2000
nstxtcout = 500
init-t = 0
delta-t = 0.002
xtcprec = 1000
fourierspacing = 0.12
nkx = 64
nky = 72
nkz = 56
pme-order = 4
ewald-rtol = 1e-05
ewald-geometry = 0
epsilon-surface = 0
optimize-fft = TRUE
ePBC = xyz
bPeriodicMols = FALSE
bContinuation = TRUE
bShakeSOR = FALSE
etc = Berendsen
bPrintNHChains = FALSE
nsttcouple = 5
epc = Berendsen
epctype = Isotropic
nstpcouple = 5
tau-p = 1
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
compress (3x3):
compress[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compress[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compress[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
verlet-buffer-drift = 0.005
rlist = 0.999
rlistlong = 0.999
nstcalclr = 5
rtpi = 0.05
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 0.9
vdwtype = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 0.9
epsilon-r = 1
epsilon-rf = inf
tabext = 1
implicit-solvent = No
gb-algorithm = Still
gb-epsilon-solvent = 80
nstgbradii = 1
rgbradii = 1
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
DispCorr = No
bSimTemp = FALSE
free-energy = no
nwall = 0
wall-type = 9-3
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = no
rotation = FALSE
disre = No
disre-weighting = Conservative
disre-mixed = FALSE
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orires-fc = 0
orires-tau = 0
nstorireout = 100
dihre-fc = 0
em-stepsize = 0.01
em-tol = 10
niter = 20
fc-stepsize = 0
nstcgsteep = 1000
nbfgscorr = 10
ConstAlg = Lincs
shake-tol = 0.0001
lincs-order = 4
lincs-warnangle = 30
lincs-iter = 1
bd-fric = 0
ld-seed = 1993
cos-accel = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
adress = FALSE
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 7097.73 71223.3 41.9984 11.9995
ref-t: 300 300 300 300
tau-t: 0.1 0.1 0.1 0.1
anneal: No No No No
ann-npoints: 0 0 0 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0 0 0 0
energygrp-flags[ 1]: 0 0 0 0
energygrp-flags[ 2]: 0 0 0 0
energygrp-flags[ 3]: 0 0 0 0
efield-x:
n = 0
efield-xt:
n = 0
efield-y:
n = 0
efield-yt:
n = 0
efield-z:
n = 0
efield-zt:
n = 0
bQMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
scalefactor = 1
qm-opts:
ngQM = 0
Using 1 MPI thread
Using 1 OpenMP thread
Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: AuthenticAMD
Brand: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
Family: 15 Model: 107 Stepping: 2
Features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr pse rdtscp sse2 sse3
Acceleration most likely to fit this hardware: SSE2
Acceleration selected at GROMACS compile time: SSE2
1 GPU detected:
#0: NVIDIA GeForce GTX 480, compute cap.: 2.0, ECC: no, stat: compatible
1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0
Will do PME sum in reciprocal space.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's: NS: 0.999 Coulomb: 0.9 LJ: 0.9
System total charge: -0.000
Generated table with 999 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 999 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 999 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Using CUDA 8x8 non-bonded kernels
NOTE: With GPUs, reporting energy group contributions is not supported
Potential shift: LJ r^-12: 3.541 r^-6 1.882, Ewald 1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 5.87e-04 size: 1536
Initializing LINear Constraint Solver
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------
The number of constraints is 3636
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, J. P. M. Postma, A. DiNola and J. R. Haak
Molecular dynamics with coupling to an external bath
J. Chem. Phys. 81 (1984) pp. 3684-3690
-------- -------- --- Thank You --- -------- --------
There are: 39209 Atoms
There are: 4 VSites
Initial temperature: 299.37 K
Started mdrun on node 0 Tue Apr 8 09:57:08 2014
Step Time Lambda
0 0.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.33721e+03 4.51248e+02 3.79061e+03 5.27540e+03 1.62273e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.73356e+04 -6.50993e+05 5.62839e+03 -5.24947e+05 9.80531e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26894e+05 3.00938e+02 4.75944e+02 1.62529e-04
step 80: timed with pme grid 64 72 56, coulomb cutoff 0.900: 7449.1
M-cycles
step 160: timed with pme grid 60 64 52, coulomb cutoff 0.969: 6643.5
M-cycles
step 240: timed with pme grid 52 56 48, coulomb cutoff 1.066: 5848.3
M-cycles
step 320: timed with pme grid 48 52 44, coulomb cutoff 1.154: 5451.7
M-cycles
step 400: timed with pme grid 44 48 40, coulomb cutoff 1.259: 5155.4
M-cycles
step 480: timed with pme grid 40 44 36, coulomb cutoff 1.399: 5345.2
M-cycles
Step Time Lambda
500 1.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.42010e+03 4.62095e+02 3.80227e+03 5.25222e+03 1.62479e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.70828e+04 -6.46020e+05 8.93429e+02 -5.24859e+05 9.78962e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26963e+05 3.00457e+02 -4.08038e+01 1.79131e-04
step 560: timed with pme grid 36 40 32, coulomb cutoff 1.574: 5623.4
M-cycles
step 640: timed with pme grid 32 36 28, coulomb cutoff 1.799: 6059.6
M-cycles
step 720: timed with pme grid 52 52 44, coulomb cutoff 1.146: 5551.0
M-cycles
step 800: timed with pme grid 48 52 44, coulomb cutoff 1.154: 5444.4
M-cycles
step 880: timed with pme grid 48 52 42, coulomb cutoff 1.199: 5416.8
M-cycles
step 960: timed with pme grid 48 48 42, coulomb cutoff 1.241: 5219.8
M-cycles
Step Time Lambda
1000 2.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.62333e+03 4.07026e+02 3.86990e+03 5.24986e+03 1.62035e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.67999e+04 -6.45748e+05 1.69649e+03 -5.23898e+05 9.71380e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26760e+05 2.98129e+02 6.34793e+01 1.81889e-04
step 1040: timed with pme grid 44 48 40, coulomb cutoff 1.259: 5147.3
M-cycles
step 1120: timed with pme grid 42 48 40, coulomb cutoff 1.319: 5222.1
M-cycles
step 1200: timed with pme grid 42 44 40, coulomb cutoff 1.354: 5265.5
M-cycles
step 1280: timed with pme grid 40 44 40, coulomb cutoff 1.385: 5323.8
M-cycles
step 1360: timed with pme grid 40 44 36, coulomb cutoff 1.399: 5330.7
M-cycles
step 1440: timed with pme grid 40 42 36, coulomb cutoff 1.418: 5362.1
M-cycles
Step Time Lambda
1500 3.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.48854e+03 4.01389e+02 3.82475e+03 5.30181e+03 1.62800e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.77397e+04 -6.47116e+05 1.04295e+03 -5.25037e+05 9.81473e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26890e+05 3.01227e+02 1.97365e+02 1.61429e-04
step 1520: timed with pme grid 40 40 36, coulomb cutoff 1.489: 5479.8
M-cycles
step 1600: timed with pme grid 36 40 36, coulomb cutoff 1.539: 5561.7
M-cycles
step 1680: timed with pme grid 36 40 32, coulomb cutoff 1.574: 5621.6
M-cycles
step 1760: timed with pme grid 36 36 32, coulomb cutoff 1.655:
18446744079486.4 M-cycles
step 1840: timed with pme grid 32 36 32, coulomb cutoff 1.732: 5910.1
M-cycles
optimal pme grid 44 48 40, coulomb cutoff 1.259
Step Time Lambda
2000 4.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.15902e+03 4.73998e+02 3.80519e+03 5.28955e+03 1.60984e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.67239e+04 -6.46186e+05 1.73056e+03 -5.24905e+05 9.82208e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26685e+05 3.01453e+02 -1.32508e+02 1.73835e-04
Step Time Lambda
2500 5.00000 0.00000
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.43140e+03 4.52490e+02 3.53021e+03 5.30039e+03 1.62136e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.76721e+04 -6.46736e+05 1.70041e+03 -5.24435e+05 9.78989e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.26536e+05 3.00465e+02 5.56417e+01 1.73350e-04
Step Time Lambda
3000 6.00000 0.00000
[...]
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.11055e+03 4.25118e+02 3.46453e+03 5.28507e+03 1.64961e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.79810e+04 -6.49054e+05 1.71960e+03 -5.26572e+05 9.86032e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.27969e+05 3.02626e+02 1.38984e+01 1.74222e-04
Step Time Lambda
500000 1000.00000 0.00000
Writing checkpoint, step 500000 at Tue Apr 8 16:24:42 2014
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.13014e+03 4.02446e+02 3.53093e+03 5.31865e+03 1.64910e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.59428e+04 -6.46111e+05 1.64510e+03 -5.25650e+05 9.77048e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.27945e+05 2.99869e+02 -1.28891e+02 1.61440e-04
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 500001 steps using 5001 frames
Energies (kJ/mol)
Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
7.24325e+03 4.20614e+02 3.45246e+03 5.28335e+03 1.63563e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
8.70652e+04 -6.47075e+05 1.68233e+03 -5.25571e+05 9.77395e+04
Total Energy Temperature Pressure (bar) Constr. rmsd
-4.27832e+05 2.99975e+02 2.65498e+00 0.00000e+00
Box-X Box-Y Box-Z
7.43280e+00 7.99132e+00 6.75655e+00
Total Virial (kJ/mol)
3.24737e+04 -2.66392e+01 -1.46523e+01
-2.63966e+01 3.25516e+04 6.88809e+01
-1.42370e+01 6.90065e+01 3.26181e+04
Pressure (bar)
6.32622e+00 1.69758e+00 -7.49524e-01
1.67749e+00 1.75993e+00 -3.96960e+00
-7.83905e-01 -3.98004e+00 -1.21194e-01
Epot (kJ/mol) Coul-SR LJ-SR Coul-14 LJ-14
Protein-Protein -6.47075e+05 8.70652e+04 1.63313e+04 5.22322e+03
Protein-SOL 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
Protein-UNK 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
Protein-Ion 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
SOL-SOL 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
SOL-UNK 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
SOL-Ion 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
UNK-UNK 0.00000e+00 0.00000e+00 2.49760e+01 6.01314e+01
UNK-Ion 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
Ion-Ion 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
T-Protein T-SOL T-UNK T-Ion
2.99940e+02 2.99980e+02 2.99534e+02 2.97557e+02
P P - P M E L O A D B A L A N C I N G
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 0.900 nm 0.999 nm 64 72 56 0.120 nm 0.288 nm
final 1.259 nm 1.358 nm 44 48 40 0.168 nm 0.403 nm
cost-ratio 2.51 0.33
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 84910.270848 764192.438 0.1
NxN QSTab Elec. + VdW [F] 25223069.779264 1034145860.950 95.9
NxN QSTab Elec. + VdW [V&F] 254850.190208 15036161.222 1.4
1,4 nonbonded interactions 4704.009408 423360.847 0.0
Calc Weights 58819.617639 2117506.235 0.2
Spread Q Bspline 1254818.509632 2509637.019 0.2
Gather F Bspline 1254818.509632 7528911.058 0.7
3D-FFT 1382948.519652 11063588.157 1.0
Solve PME 1055.989952 67583.357 0.0
Shift-X 490.201713 2941.210 0.0
Angles 3264.006528 548353.097 0.1
Propers 377.500755 86447.673 0.0
RB-Dihedrals 3720.507441 918965.338 0.1
Virial 3925.839258 70665.107 0.0
Stop-CM 196.143426 1961.434 0.0
P-Coupling 19606.539213 117639.235 0.0
Calc-Ekin 7842.678426 211752.318 0.0
Lincs 1818.003636 109080.218 0.0
Lincs-Mat 39168.078336 156672.313 0.0
Constraint-V 21442.542885 171540.343 0.0
Constraint-Vir 3924.939249 94198.542 0.0
Settle 5935.511871 1917170.334 0.2
Virtual Site 2 2.400008 55.200 0.0
-----------------------------------------------------------------------------
Total 1078064243.645 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Th. Count Wall t (s) G-Cycles %
-----------------------------------------------------------------------------
Vsite constr. 1 1 500001 4.359 12.641 0.0
Neighbor search 1 1 12501 629.074 1824.467 2.7
Launch GPU ops. 1 1 500001 100.561 291.650 0.4
Force 1 1 500001 2266.241 6572.645 9.7
PME mesh 1 1 500001 15297.280 44365.799 65.8
Wait GPU local 1 1 500001 164.817 478.010 0.7
NB X/F buffer ops. 1 1 987501 798.856 2316.876 3.4
Vsite spread 1 1 600002 5.753 16.684 0.0
Write traj. 1 1 1023 6.518 18.904 0.0
Update 1 1 500001 557.942 1618.167 2.4
Constraints 1 1 500001 2758.223 7999.512 11.9
Rest 1 665.131 1929.040 2.9
-----------------------------------------------------------------------------
Total 1 23254.756 67444.397 100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
PME spread/gather 1 1 1000002 13321.576 38635.780 57.3
PME 3D-FFT 1 1 1000002 1358.828 3940.929 5.8
PME solve 1 1 500001 611.302 1772.923 2.6
-----------------------------------------------------------------------------
GPU timings
-----------------------------------------------------------------------------
Computing: Count Wall t (s) ms/step %
-----------------------------------------------------------------------------
Pair list H2D 12501 13.744 1.099 0.2
X / q H2D 500001 167.017 0.334 2.4
Nonbonded F kernel 400000 4858.062 12.145 69.5
Nonbonded F+ene k. 87500 1360.304 15.546 19.5
Nonbonded F+ene+prune k. 12501 200.021 16.000 2.9
F D2H 500001 389.379 0.779 5.6
-----------------------------------------------------------------------------
Total 6988.527 13.977 100.0
-----------------------------------------------------------------------------
Force evaluation time GPU/CPU: 13.977 ms/35.127 ms = 0.398
For optimal performance this ratio should be close to 1!
NOTE: The GPU has >25% less load than the CPU. This imbalance causes
performance loss.
Core t (s) Wall t (s) (%)
Time: 23228.620 23254.756 99.9
6h27:34
(ns/day) (hour/ns)
Performance: 3.715 6.460
Finished mdrun on node 0 Tue Apr 8 16:24:42 2014
On 07/04/2014 17:32, Szil?rd P?ll <pall.szilard at gmail.com> wrote:
> Please post a log file, that would help with giving you more concrete
> advice. My guess is that you're running reaction-field (otherwise you
> must have tuned off PP-PME load balancing), but I'll comment more when
> I see a log file.
>
> Cheers,
> --
> Szil?rd
>
>
> On Mon, Apr 7, 2014 at 4:33 PM, Dario Corrada <dario.corrada at gmail.com> wrote:
>> I have a machine with AMD Athlon 64 dual core with an NVIDIA GeForce GTX
>> 480.
>>
>> In order to optimize performances I'd like to redirect my calculation toward
>> GPU as much as possible.
>>
>> I tried mdrun -nt 1 -nb gpu ..., but I have obtained such kind of message:
>>
>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>> performance loss.
>>
>> How can I improve mdrun performance?
>>
>> --
>> Dario CORRADA, PhD
>> Bioinformatics and Computational Chemistry specialist
>>
>> URL......: http://it.linkedin.com/in/dariocorrada/
>> mail.....: dario.corrada at gmail.com
>> skype....: dario.corrada
>> tel......: +39 333 5347024
>> address..: via Benvenuto Cellini, 4 - 20900 Monza IT
>>
>> "When you have eliminated the impossible, whatever remains, however
>> improbable, must be the truth."
>> [A.C. Doyle]
More information about the gromacs.org_gmx-users
mailing list