[gmx-users] GPU low performance
Carmen Di Giovanni
cdigiova at unina.it
Wed Feb 18 17:20:52 CET 2015
Justin,
the problem is evident for all calculations.
This is the log file of a recent run:
--------------------------------------------------------------------------------
Log file opened on Mon Dec 22 16:28:00 2014
Host: localhost.localdomain pid: 8378 rank ID: 0 number of ranks: 1
GROMACS: gmx mdrun, VERSION 5.0
GROMACS is written by:
Emile Apol Rossen Apostolov Herman J.C. Berendsen Par Bjelkmar
Aldert van Buuren Rudi van Drunen Anton Feenstra Sebastian Fritsch
Gerrit Groenhof Christoph Junghans Peter Kasson Carsten Kutzner
Per Larsson Justin A. Lemkul Magnus Lundborg Pieter Meulenhoff
Erik Marklund Teemu Murtola Szilard Pall Sander Pronk
Roland Schulz Alexey Shvetsov Michael Shirts Alfons Sijbers
Peter Tieleman Christian Wennberg Maarten Wolf
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2014, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
GROMACS: gmx mdrun, VERSION 5.0
Executable: /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
Library dir: /opt/SW/gromacs-5.0/share/top
Command line:
gmx_mpi mdrun -deffnm prod_20ns
Gromacs version: VERSION 5.0
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled
GPU support: enabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX_256
FFT library: fftw-3.3.3-sse2
RDTSCP usage: enabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Thu Jul 31 18:30:37 CEST 2014
Built by: root at localhost.localdomain [CMAKE]
Build OS/arch: Linux 2.6.32-431.el6.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Build CPU family: 6 Model: 62 Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 4.4.7
C compiler flags: -mavx -Wno-maybe-uninitialized -Wextra
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith
-Wall -Wno-unused -Wunused-value -Wunused-parameter
-fomit-frame-pointer -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 4.4.7
C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers
-Wpointer-arith -Wall -Wno-unused-function -fomit-frame-pointer
-funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
Boost version: 1.55.0 (internal)
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC ;
;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
CUDA driver: 6.50
CUDA runtime: 6.0
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 40, rlist from 1.2 to 1.285
Input Parameters:
integrator = md
tinit = 0
dt = 0.002
nsteps = 10000000
init-step = 0
simulation-part = 1
comm-mode = Linear
nstcomm = 1
bd-fric = 0
ld-seed = 1993
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 2500
nstvout = 2500
nstfout = 0
nstlog = 2500
nstcalcenergy = 1
nstenergy = 2500
nstxout-compressed = 500
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 40
ns-type = Grid
pbc = xyz
periodic-molecules = FALSE
verlet-buffer-tolerance = 0.005
rlist = 1.285
rlistlong = 1.285
nstcalclr = 10
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 1.2
epsilon-r = 1
epsilon-rf = 1
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 1.2
DispCorr = No
table-extension = 1
fourierspacing = 0.135
fourier-nx = 128
fourier-ny = 128
fourier-nz = 128
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
implicit-solvent = No
gb-algorithm = Still
nstgbradii = 1
rgbradii = 2
gb-epsilon-solvent = 80
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.092
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = FALSE
pcoupl = No
pcoupltype = Semiisotropic
nstpcouple = -1
tau-p = 0.5
compressibility (3x3):
compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p (3x3):
ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
refcoord-scaling = No
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = FALSE
Shake-SOR = FALSE
shake-tol = 0.0001
lincs-order = 4
lincs-iter = 1
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = no
rotation = FALSE
interactiveMD = FALSE
disre = No
disre-weighting = Conservative
disre-mixed = FALSE
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = FALSE
E-x:
n = 0
E-xt:
n = 0
E-y:
n = 0
E-yt:
n = 0
E-z:
n = 0
E-zt:
n = 0
swapcoords = no
adress = FALSE
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 869226
ref-t: 300
tau-t: 0.1
annealing: No
annealing-npoints: 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
Using 1 MPI process
Using 32 OpenMP threads
Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Family: 6 Model: 62 Stepping: 4
Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX_256
SIMD instructions selected at GROMACS compile time: AVX_256
2 GPUs detected on host localhost.localdomain:
#0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
#1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC: no, stat: compatible
1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0
NOTE: potentially sub-optimal launch configuration, gmx_mpi started with less
PP MPI process per node than GPUs available.
Each PP MPI process can use only one GPU, 1 GPU per node will be used.
Will do PME sum in reciprocal space for electrostatic interactions.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Cut-off's: NS: 1.285 Coulomb: 1.2 LJ: 1.2
System total charge: -0.012
Generated table with 1142 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1142 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1142 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Using CUDA 8x8 non-bonded kernels
Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size: 1536
Removing pbc first time
Pinning threads with an auto-selected logical core stride of 1
Initializing LINear Constraint Solver
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------
The number of constraints is 5913
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
0: rest
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- --- Thank You --- -------- --------
There are: 434658 Atoms
Constraining the starting coordinates (step 0)
Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 3.67e-05
Initial temperature: 300.5 K
Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
Step Time Lambda
0 0.00000 0.00000
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
9.74139e+03 4.34956e+03 2.97359e+03 -1.93107e+02 8.05534e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
1.01340e+06 -7.13271e+06 2.01361e+04 -6.00175e+06 1.09887e+06
Total Energy Conserved En. Temperature Pressure (bar) Constr. rmsd
-4.90288e+06 -4.90288e+06 3.04092e+02 1.70897e+02 2.16683e-05
step 80: timed with pme grid 128 128 128, coulomb cutoff 1.200:
6279.0 M-cycles
step 160: timed with pme grid 112 112 112, coulomb cutoff 1.306:
6962.2 M-cycles
step 240: timed with pme grid 100 100 100, coulomb cutoff 1.463:
8406.5 M-cycles
step 320: timed with pme grid 128 128 128, coulomb cutoff 1.200:
6424.0 M-cycles
step 400: timed with pme grid 120 120 120, coulomb cutoff 1.219:
6369.1 M-cycles
step 480: timed with pme grid 112 112 112, coulomb cutoff 1.306:
7309.0 M-cycles
step 560: timed with pme grid 108 108 108, coulomb cutoff 1.355:
7521.2 M-cycles
step 640: timed with pme grid 104 104 104, coulomb cutoff 1.407:
8369.8 M-cycles
optimal pme grid 128 128 128, coulomb cutoff 1.200
Step Time Lambda
2500 5.00000 0.00000
Energies (kJ/mol)
G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
9.72545e+03 4.33046e+03 2.98087e+03 -1.95794e+02 8.05967e+04
LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
1.01293e+06 -7.13110e+06 2.01689e+04 -6.00057e+06 1.08489e+06
Total Energy Conserved En. Temperature Pressure (bar) Constr. rmsd
-4.91567e+06 -4.90300e+06 3.00225e+02 1.36173e+02 2.25998e-05
Step Time Lambda
5000 10.00000 0.00000
............
-------------------------------------------------------------------------------
Thank you in advance
--
Carmen Di Giovanni, PhD
Dept. of Pharmaceutical and Toxicological Chemistry
"Drug Discovery Lab"
University of Naples "Federico II"
Via D. Montesano, 49
80131 Naples
Tel.: ++39 081 678623
Fax: ++39 081 678100
Email: cdigiova at unina.it
Quoting Justin Lemkul <jalemkul at vt.edu>:
>
>
> On 2/18/15 11:09 AM, Barnett, James W wrote:
>> What's your exact command?
>>
>
> A full .log file would be even better; it would tell us everything
> we need to know :)
>
> -Justin
>
>> Have you reviewed this page:
>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>
>> James "Wes" Barnett
>> Ph.D. Candidate
>> Chemical and Biomolecular Engineering
>>
>> Tulane University
>> Boggs Center for Energy and Biotechnology, Room 341-B
>>
>> ________________________________________
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of
>> Carmen Di Giovanni <cdigiova at unina.it>
>> Sent: Wednesday, February 18, 2015 10:06 AM
>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>> Subject: Re: [gmx-users] GPU low performance
>>
>> I post the message of a md run :
>>
>>
>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>> performance loss, consider using a shorter cut-off and a
>> finer PME grid.
>>
>> As can I solved this problem ?
>> Thank you in advance
>>
>>
>> --
>> Carmen Di Giovanni, PhD
>> Dept. of Pharmaceutical and Toxicological Chemistry
>> "Drug Discovery Lab"
>> University of Naples "Federico II"
>> Via D. Montesano, 49
>> 80131 Naples
>> Tel.: ++39 081 678623
>> Fax: ++39 081 678100
>> Email: cdigiova at unina.it
>>
>>
>>
>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>
>>>
>>>
>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>> Daear all,
>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>
>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>> For optimal performance this ratio should be close to 1!
>>>>
>>>
>>> Minimization is a poor indicator of performance. Do a real MD run.
>>>
>>>>
>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>> performance loss.
>>>>
>>>> Core t (s) Wall t (s) (%)
>>>> Time: 3289.010 205.891 1597.4
>>>> (steps/hour)
>>>> Performance: 8480.2
>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>
>>>>
>>>> Cai I improve the performance?
>>>> At the moment in the forum I didn't full informations to solve
>>>> this problem.
>>>> In attachment there is the log. file
>>>>
>>>
>>> The list does not accept attachments. If you wish to share a file,
>>> upload it to a file-sharing service and provide a URL. The full
>>> .log is quite important for understanding your hardware,
>>> optimizations, and seeing full details of the performance breakdown.
>>> But again, base your assessment on MD, not EM.
>>>
>>> -Justin
>>>
>>> --
>>> ==================================================
>>>
>>> Justin A. Lemkul, Ph.D.
>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>
>>> Department of Pharmaceutical Sciences
>>> School of Pharmacy
>>> Health Sciences Facility II, Room 629
>>> University of Maryland, Baltimore
>>> 20 Penn St.
>>> Baltimore, MD 21201
>>>
>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>> http://mackerell.umaryland.edu/~jalemkul
>>>
>>> ==================================================
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>> or send a mail to gmx-users-request at gromacs.org.
>>>
>>>
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or send a mail to gmx-users-request at gromacs.org.
>>
>
> --
> ==================================================
>
> Justin A. Lemkul, Ph.D.
> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>
> Department of Pharmaceutical Sciences
> School of Pharmacy
> Health Sciences Facility II, Room 629
> University of Maryland, Baltimore
> 20 Penn St.
> Baltimore, MD 21201
>
> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
> http://mackerell.umaryland.edu/~jalemkul
>
> ==================================================
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or send a mail to gmx-users-request at gromacs.org.
>
>
More information about the gromacs.org_gmx-users
mailing list