[gmx-users] GPU low performance
Szilárd Páll
pall.szilard at gmail.com
Wed Feb 18 18:18:09 CET 2015
I've just noticed something serious. Why are you calculating energies
every step? Doing that makes the non-bonded force calculation on
average 25-30% slower than e.g. calculating energies every 100-th
step.
You may be able to get another 5% or so form your GPU, could you post
the output of "nvidia-smi -q -g 0"?
--
Szilárd
On Wed, Feb 18, 2015 at 6:14 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> On Wed, Feb 18, 2015 at 5:57 PM, Carmen Di Giovanni <cdigiova at unina.it> wrote:
>> Dear all, the full log file is too big.
>
> Use pastebin or similar services.
>
>> However in the middle part of it, there are only informations about the
>> energies at each time. The first part is alrady posted.
>
> OK, so first of all, this looks nothing like the alarmingly low
> CPU-GPU overlap you posted about initially. Here, the GPU you are
> using simply can't keep up with 2x8 Haswell-E cores. You observing
> this by looking at the fraction of runtime spent by the CPU waiting
> for the GPU displayed in the performace table's "Wait GPU local" row
> which shows 28.7% idling.
>
> At the moment, the non-bonded computation which is fully don on the
> GPU can't be split between CPU and GPU, so your options are limited
> and most of these will a minor effect:
> i) indirectly shift work back to the CPU and/or improve the overlap efficiency
> a) try decreasing nstlist to 10-20-25
> b) run on less threads (as suggested before) which will likely
> improve performance in some non-overlap code parts
> c) run with DD, e.g. -ntmpi 4 -ntomp 4/8 -gpu_id 0011 or -ntmpi 8
> -gpu_id 00001111
>
> ii) Reduce the "Rest" time. Not sure what's causing it, but you
> simulation spends a substantial amount (15.6%) of the runtime in
> unaccounted for likely serial calculation; i-b and i-c will likely
> reduce this somewhat too;
>
> iii) get more/faster or GPUs
>
>> So I post the final part of it:
>> -------------------------------------------------------------
>> Step Time Lambda
>> 10000000 20000.00000 0.00000
>>
>> Writing checkpoint, step 10000000 at Mon Dec 29 13:16:22 2014
>>
>>
>> Energies (kJ/mol)
>> G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
>> 9.34206e+03 4.14342e+03 2.79172e+03 -1.75465e+02 7.99811e+04
>> LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
>> 1.01135e+06 -7.13064e+06 2.01349e+04 -6.00306e+06 1.08201e+06
>> Total Energy Conserved En. Temperature Pressure (bar) Constr. rmsd
>> -4.92106e+06 -5.86747e+06 2.99426e+02 1.29480e+02 2.16280e-05
>>
>> <====== ############### ==>
>> <==== A V E R A G E S ====>
>> <== ############### ======>
>>
>> Statistics over 10000001 steps using 10000001 frames
>>
>> Energies (kJ/mol)
>> G96Angle Proper Dih. Improper Dih. LJ-14 Coulomb-14
>> 9.45818e+03 4.30665e+03 2.92407e+03 -1.75556e+02 8.02473e+04
>> LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic En.
>> 1.01284e+06 -7.13138e+06 2.01510e+04 -6.00163e+06 1.08407e+06
>> Total Energy Conserved En. Temperature Pressure (bar) Constr. rmsd
>> -4.91756e+06 -5.38519e+06 2.99998e+02 1.37549e+02 0.00000e+00
>>
>> Total Virial (kJ/mol)
>> 3.42887e+05 1.63625e+01 1.23658e+02
>> 1.67406e+01 3.42916e+05 -4.27834e+01
>> 1.23997e+02 -4.29636e+01 3.42881e+05
>>
>> Pressure (bar)
>> 1.37573e+02 7.50214e-02 -1.03916e-01
>> 7.22048e-02 1.37623e+02 -1.66417e-02
>> -1.06444e-01 -1.52990e-02 1.37453e+02
>>
>>
>> M E G A - F L O P S A C C O U N T I N G
>>
>> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
>> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
>> W3=SPC/TIP3p W4=TIP4p (single or pairs)
>> V&F=Potential and force V=Potential only F=Force only
>>
>> Computing: M-Number M-Flops % Flops
>> -----------------------------------------------------------------------------
>> Pair Search distance check 16343508.605344 147091577.448 0.0
>> NxN Ewald Elec. + LJ [V&F] 5072118956.506304 542716728346.174 98.1
>> 1,4 nonbonded interactions 95860.009586 8627400.863 0.0
>> Calc Weights 13039741.303974 469430686.943 0.1
>> Spread Q Bspline 278181147.818112 556362295.636 0.1
>> Gather F Bspline 278181147.818112 1669086886.909 0.3
>> 3D-FFT 880787450.909824 7046299607.279 1.3
>> Solve PME 163837.909504 10485626.208 0.0
>> Shift-X 108664.934658 651989.608 0.0
>> Angles 86090.008609 14463121.446 0.0
>> Propers 31380.003138 7186020.719 0.0
>> Impropers 28790.002879 5988320.599 0.0
>> Virial 4347030.434703 78246547.825 0.0
>> Stop-CM 4346580.869316 43465808.693 0.0
>> Calc-Ekin 4346580.869316 117357683.472 0.0
>> Lincs 59130.017739 3547801.064 0.0
>> Lincs-Mat 1033080.309924 4132321.240 0.0
>> Constraint-V 4406580.881316 35252647.051 0.0
>> Constraint-Vir 4347450.434745 104338810.434 0.0
>> Settle 1429440.428832 461709258.513 0.1
>> -----------------------------------------------------------------------------
>> Total 553500452758.122 100.0
>> -----------------------------------------------------------------------------
>>
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> On 1 MPI rank, each using 32 OpenMP threads
>>
>> Computing: Num Num Call Wall time Giga-Cycles
>> Ranks Threads Count (s) total sum %
>> -----------------------------------------------------------------------------
>> Neighbor search 1 32 250001 6231.657 518475.694 1.1
>> Launch GPU ops. 1 32 10000001 1825.689 151897.833 0.3
>> Force 1 32 10000001 49568.959 4124152.027 8.4
>> PME mesh 1 32 10000001 194798.850 16207321.863 32.8
>> Wait GPU local 1 32 10000001 170272.438 14166717.115 28.7
>> NB X/F buffer ops. 1 32 19750001 29175.632 2427421.177 4.9
>> Write traj. 1 32 20635 1567.928 130452.056 0.3
>> Update 1 32 10000001 13312.819 1107630.452 2.2
>> Constraints 1 32 10000001 34210.142 2846293.908 5.8
>> Rest 92338.781 7682613.897 15.6
>> -----------------------------------------------------------------------------
>> Total 593302.894 49362976.023 100.0
>> -----------------------------------------------------------------------------
>> Breakdown of PME mesh computation
>> -----------------------------------------------------------------------------
>> PME spread/gather 1 32 20000002 144767.207 12044674.424 24.4
>> PME 3D-FFT 1 32 20000002 39499.157 3286341.501 6.7
>> PME solve Elec 1 32 10000001 9947.340 827621.589 1.7
>> -----------------------------------------------------------------------------
>>
>> GPU timings
>> -----------------------------------------------------------------------------
>> Computing: Count Wall t (s) ms/step %
>> -----------------------------------------------------------------------------
>> Pair list H2D 250001 935.751 3.743 0.2
>> X / q H2D 10000001 11509.209 1.151 2.8
>> Nonbonded F+ene k. 9750000 377111.949 38.678 92.0
>> Nonbonded F+ene+prune k. 250001 12049.010 48.196 2.9
>> F D2H 10000001 8129.292 0.813 2.0
>> -----------------------------------------------------------------------------
>> Total 409735.211 40.974 100.0
>> -----------------------------------------------------------------------------
>>
>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>> performance loss, consider using a shorter cut-off and a finer PME
>> grid.
>>
>> Core t (s) Wall t (s) (%)
>> Time: 18713831.228 593302.894 3154.2
>> 6d20h48:22
>> (ns/day) (hour/ns)
>> Performance: 2.913 8.240
>> Finished mdrun on rank 0 Mon Dec 29 13:16:24 2014
>>
>>
>> -------------------------------------------------------
>> thank you in advance
>> Carmen
>>
>>
>>
>> --
>> Carmen Di Giovanni, PhD
>> Dept. of Pharmaceutical and Toxicological Chemistry
>> "Drug Discovery Lab"
>> University of Naples "Federico II"
>> Via D. Montesano, 49
>> 80131 Naples
>> Tel.: ++39 081 678623
>> Fax: ++39 081 678100
>> Email: cdigiova at unina.it
>>
>>
>>
>> Quoting Szilárd Páll <pall.szilard at gmail.com>:
>>
>>> We need a *full* log file, not parts of it!
>>>
>>> You can try running with "-ntomp 16 -pin on" - it may be a bit faster
>>> not not use HyperThreading.
>>> --
>>> Szilárd
>>>
>>>
>>> On Wed, Feb 18, 2015 at 5:20 PM, Carmen Di Giovanni <cdigiova at unina.it>
>>> wrote:
>>>>
>>>> Justin,
>>>> the problem is evident for all calculations.
>>>> This is the log file of a recent run:
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>>
>>>> Log file opened on Mon Dec 22 16:28:00 2014
>>>> Host: localhost.localdomain pid: 8378 rank ID: 0 number of ranks: 1
>>>> GROMACS: gmx mdrun, VERSION 5.0
>>>>
>>>> GROMACS is written by:
>>>> Emile Apol Rossen Apostolov Herman J.C. Berendsen Par Bjelkmar
>>>> Aldert van Buuren Rudi van Drunen Anton Feenstra Sebastian
>>>> Fritsch
>>>> Gerrit Groenhof Christoph Junghans Peter Kasson Carsten Kutzner
>>>> Per Larsson Justin A. Lemkul Magnus Lundborg Pieter
>>>> Meulenhoff
>>>> Erik Marklund Teemu Murtola Szilard Pall Sander Pronk
>>>> Roland Schulz Alexey Shvetsov Michael Shirts Alfons Sijbers
>>>> Peter Tieleman Christian Wennberg Maarten Wolf
>>>> and the project leaders:
>>>> Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
>>>>
>>>> Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>>>> Copyright (c) 2001-2014, The GROMACS development team at
>>>> Uppsala University, Stockholm University and
>>>> the Royal Institute of Technology, Sweden.
>>>> check out http://www.gromacs.org for more information.
>>>>
>>>> GROMACS is free software; you can redistribute it and/or modify it
>>>> under the terms of the GNU Lesser General Public License
>>>> as published by the Free Software Foundation; either version 2.1
>>>> of the License, or (at your option) any later version.
>>>>
>>>> GROMACS: gmx mdrun, VERSION 5.0
>>>> Executable: /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
>>>> Library dir: /opt/SW/gromacs-5.0/share/top
>>>> Command line:
>>>> gmx_mpi mdrun -deffnm prod_20ns
>>>>
>>>> Gromacs version: VERSION 5.0
>>>> Precision: single
>>>> Memory model: 64 bit
>>>> MPI library: MPI
>>>> OpenMP support: enabled
>>>> GPU support: enabled
>>>> invsqrt routine: gmx_software_invsqrt(x)
>>>> SIMD instructions: AVX_256
>>>> FFT library: fftw-3.3.3-sse2
>>>> RDTSCP usage: enabled
>>>> C++11 compilation: disabled
>>>> TNG support: enabled
>>>> Tracing support: disabled
>>>> Built on: Thu Jul 31 18:30:37 CEST 2014
>>>> Built by: root at localhost.localdomain [CMAKE]
>>>> Build OS/arch: Linux 2.6.32-431.el6.x86_64 x86_64
>>>> Build CPU vendor: GenuineIntel
>>>> Build CPU brand: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>>> Build CPU family: 6 Model: 62 Stepping: 4
>>>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx
>>>> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
>>>> sse3
>>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>>> C compiler: /usr/bin/cc GNU 4.4.7
>>>> C compiler flags: -mavx -Wno-maybe-uninitialized -Wextra
>>>> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
>>>> -Wno-unused -Wunused-value -Wunused-parameter -fomit-frame-pointer
>>>> -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
>>>> C++ compiler: /usr/bin/c++ GNU 4.4.7
>>>> C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers
>>>> -Wpointer-arith -Wall -Wno-unused-function -fomit-frame-pointer
>>>> -funroll-all-loops -Wno-array-bounds -O3 -DNDEBUG
>>>> Boost version: 1.55.0 (internal)
>>>> CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
>>>> compiler
>>>> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
>>>> Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
>>>> CUDA compiler
>>>>
>>>> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
>>>> ;
>>>>
>>>> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
>>>> CUDA driver: 6.50
>>>> CUDA runtime: 6.0
>>>>
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
>>>> GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
>>>> molecular simulation
>>>> J. Chem. Theory Comput. 4 (2008) pp. 435-447
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J.
>>>> C.
>>>> Berendsen
>>>> GROMACS: Fast, Flexible and Free
>>>> J. Comp. Chem. 26 (2005) pp. 1701-1719
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> E. Lindahl and B. Hess and D. van der Spoel
>>>> GROMACS 3.0: A package for molecular simulation and trajectory analysis
>>>> J. Mol. Mod. 7 (2001) pp. 306-317
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> H. J. C. Berendsen, D. van der Spoel and R. van Drunen
>>>> GROMACS: A message-passing parallel molecular dynamics implementation
>>>> Comp. Phys. Comm. 91 (1995) pp. 43-56
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> For optimal performance with a GPU nstlist (now 10) should be larger.
>>>> The optimum depends on your CPU and GPU resources.
>>>> You might want to try several nstlist values.
>>>> Changing nstlist from 10 to 40, rlist from 1.2 to 1.285
>>>>
>>>> Input Parameters:
>>>> integrator = md
>>>> tinit = 0
>>>> dt = 0.002
>>>> nsteps = 10000000
>>>> init-step = 0
>>>> simulation-part = 1
>>>> comm-mode = Linear
>>>> nstcomm = 1
>>>> bd-fric = 0
>>>> ld-seed = 1993
>>>> emtol = 10
>>>> emstep = 0.01
>>>> niter = 20
>>>> fcstep = 0
>>>> nstcgsteep = 1000
>>>> nbfgscorr = 10
>>>> rtpi = 0.05
>>>> nstxout = 2500
>>>> nstvout = 2500
>>>> nstfout = 0
>>>> nstlog = 2500
>>>> nstcalcenergy = 1
>>>> nstenergy = 2500
>>>> nstxout-compressed = 500
>>>> compressed-x-precision = 1000
>>>> cutoff-scheme = Verlet
>>>> nstlist = 40
>>>> ns-type = Grid
>>>> pbc = xyz
>>>> periodic-molecules = FALSE
>>>> verlet-buffer-tolerance = 0.005
>>>> rlist = 1.285
>>>> rlistlong = 1.285
>>>> nstcalclr = 10
>>>> coulombtype = PME
>>>> coulomb-modifier = Potential-shift
>>>> rcoulomb-switch = 0
>>>> rcoulomb = 1.2
>>>> epsilon-r = 1
>>>> epsilon-rf = 1
>>>> vdw-type = Cut-off
>>>> vdw-modifier = Potential-shift
>>>> rvdw-switch = 0
>>>> rvdw = 1.2
>>>> DispCorr = No
>>>> table-extension = 1
>>>> fourierspacing = 0.135
>>>> fourier-nx = 128
>>>> fourier-ny = 128
>>>> fourier-nz = 128
>>>> pme-order = 4
>>>> ewald-rtol = 1e-05
>>>> ewald-rtol-lj = 0.001
>>>> lj-pme-comb-rule = Geometric
>>>> ewald-geometry = 0
>>>> epsilon-surface = 0
>>>> implicit-solvent = No
>>>> gb-algorithm = Still
>>>> nstgbradii = 1
>>>> rgbradii = 2
>>>> gb-epsilon-solvent = 80
>>>> gb-saltconc = 0
>>>> gb-obc-alpha = 1
>>>> gb-obc-beta = 0.8
>>>> gb-obc-gamma = 4.85
>>>> gb-dielectric-offset = 0.009
>>>> sa-algorithm = Ace-approximation
>>>> sa-surface-tension = 2.092
>>>> tcoupl = V-rescale
>>>> nsttcouple = 10
>>>> nh-chain-length = 0
>>>> print-nose-hoover-chain-variables = FALSE
>>>> pcoupl = No
>>>> pcoupltype = Semiisotropic
>>>> nstpcouple = -1
>>>> tau-p = 0.5
>>>> compressibility (3x3):
>>>> compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> ref-p (3x3):
>>>> ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> refcoord-scaling = No
>>>> posres-com (3):
>>>> posres-com[0]= 0.00000e+00
>>>> posres-com[1]= 0.00000e+00
>>>> posres-com[2]= 0.00000e+00
>>>> posres-comB (3):
>>>> posres-comB[0]= 0.00000e+00
>>>> posres-comB[1]= 0.00000e+00
>>>> posres-comB[2]= 0.00000e+00
>>>> QMMM = FALSE
>>>> QMconstraints = 0
>>>> QMMMscheme = 0
>>>> MMChargeScaleFactor = 1
>>>> qm-opts:
>>>> ngQM = 0
>>>> constraint-algorithm = Lincs
>>>> continuation = FALSE
>>>> Shake-SOR = FALSE
>>>> shake-tol = 0.0001
>>>> lincs-order = 4
>>>> lincs-iter = 1
>>>> lincs-warnangle = 30
>>>> nwall = 0
>>>> wall-type = 9-3
>>>> wall-r-linpot = -1
>>>> wall-atomtype[0] = -1
>>>> wall-atomtype[1] = -1
>>>> wall-density[0] = 0
>>>> wall-density[1] = 0
>>>> wall-ewald-zfac = 3
>>>> pull = no
>>>> rotation = FALSE
>>>> interactiveMD = FALSE
>>>> disre = No
>>>> disre-weighting = Conservative
>>>> disre-mixed = FALSE
>>>> dr-fc = 1000
>>>> dr-tau = 0
>>>> nstdisreout = 100
>>>> orire-fc = 0
>>>> orire-tau = 0
>>>> nstorireout = 100
>>>> free-energy = no
>>>> cos-acceleration = 0
>>>> deform (3x3):
>>>> deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
>>>> simulated-tempering = FALSE
>>>> E-x:
>>>> n = 0
>>>> E-xt:
>>>> n = 0
>>>> E-y:
>>>> n = 0
>>>> E-yt:
>>>> n = 0
>>>> E-z:
>>>> n = 0
>>>> E-zt:
>>>> n = 0
>>>> swapcoords = no
>>>> adress = FALSE
>>>> userint1 = 0
>>>> userint2 = 0
>>>> userint3 = 0
>>>> userint4 = 0
>>>> userreal1 = 0
>>>> userreal2 = 0
>>>> userreal3 = 0
>>>> userreal4 = 0
>>>> grpopts:
>>>> nrdf: 869226
>>>> ref-t: 300
>>>> tau-t: 0.1
>>>> annealing: No
>>>> annealing-npoints: 0
>>>> acc: 0 0 0
>>>> nfreeze: N N N
>>>> energygrp-flags[ 0]: 0
>>>> Using 1 MPI process
>>>> Using 32 OpenMP threads
>>>>
>>>> Detecting CPU SIMD instructions.
>>>> Present hardware specification:
>>>> Vendor: GenuineIntel
>>>> Brand: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>>> Family: 6 Model: 62 Stepping: 4
>>>> Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
>>>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
>>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>>> SIMD instructions most likely to fit this hardware: AVX_256
>>>> SIMD instructions selected at GROMACS compile time: AVX_256
>>>>
>>>>
>>>> 2 GPUs detected on host localhost.localdomain:
>>>> #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
>>>> #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC: no, stat:
>>>> compatible
>>>>
>>>> 1 GPU auto-selected for this run.
>>>> Mapping of GPU to the 1 PP rank in this node: #0
>>>>
>>>>
>>>> NOTE: potentially sub-optimal launch configuration, gmx_mpi started with
>>>> less
>>>> PP MPI process per node than GPUs available.
>>>> Each PP MPI process can use only one GPU, 1 GPU per node will be
>>>> used.
>>>>
>>>> Will do PME sum in reciprocal space for electrostatic interactions.
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
>>>> Pedersen
>>>> A smooth particle mesh Ewald method
>>>> J. Chem. Phys. 103 (1995) pp. 8577-8592
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> Will do ordinary reciprocal space Ewald sum.
>>>> Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
>>>> Cut-off's: NS: 1.285 Coulomb: 1.2 LJ: 1.2
>>>> System total charge: -0.012
>>>> Generated table with 1142 data points for Ewald.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for LJ6.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for LJ12.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 COUL.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 LJ6.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 LJ12.
>>>> Tabscale = 500 points/nm
>>>>
>>>> Using CUDA 8x8 non-bonded kernels
>>>>
>>>> Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
>>>> Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size:
>>>> 1536
>>>>
>>>> Removing pbc first time
>>>> Pinning threads with an auto-selected logical core stride of 1
>>>>
>>>> Initializing LINear Constraint Solver
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
>>>> LINCS: A Linear Constraint Solver for molecular simulations
>>>> J. Comp. Chem. 18 (1997) pp. 1463-1472
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> The number of constraints is 5913
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> S. Miyamoto and P. A. Kollman
>>>> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for
>>>> Rigid
>>>> Water Models
>>>> J. Comp. Chem. 13 (1992) pp. 952-962
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> Center of mass motion removal mode is Linear
>>>> We have the following groups for center of mass motion removal:
>>>> 0: rest
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> G. Bussi, D. Donadio and M. Parrinello
>>>> Canonical sampling through velocity rescaling
>>>> J. Chem. Phys. 126 (2007) pp. 014101
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> There are: 434658 Atoms
>>>>
>>>> Constraining the starting coordinates (step 0)
>>>>
>>>> Constraining the coordinates at t0-dt (step 0)
>>>> RMS relative constraint deviation after constraining: 3.67e-05
>>>> Initial temperature: 300.5 K
>>>>
>>>> Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
>>>> Step Time Lambda
>>>> 0 0.00000 0.00000
>>>>
>>>> Energies (kJ/mol)
>>>> G96Angle Proper Dih. Improper Dih. LJ-14
>>>> Coulomb-14
>>>> 9.74139e+03 4.34956e+03 2.97359e+03 -1.93107e+02
>>>> 8.05534e+04
>>>> LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic
>>>> En.
>>>> 1.01340e+06 -7.13271e+06 2.01361e+04 -6.00175e+06
>>>> 1.09887e+06
>>>> Total Energy Conserved En. Temperature Pressure (bar) Constr.
>>>> rmsd
>>>> -4.90288e+06 -4.90288e+06 3.04092e+02 1.70897e+02
>>>> 2.16683e-05
>>>>
>>>> step 80: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6279.0
>>>> M-cycles
>>>> step 160: timed with pme grid 112 112 112, coulomb cutoff 1.306: 6962.2
>>>> M-cycles
>>>> step 240: timed with pme grid 100 100 100, coulomb cutoff 1.463: 8406.5
>>>> M-cycles
>>>> step 320: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6424.0
>>>> M-cycles
>>>> step 400: timed with pme grid 120 120 120, coulomb cutoff 1.219: 6369.1
>>>> M-cycles
>>>> step 480: timed with pme grid 112 112 112, coulomb cutoff 1.306: 7309.0
>>>> M-cycles
>>>> step 560: timed with pme grid 108 108 108, coulomb cutoff 1.355: 7521.2
>>>> M-cycles
>>>> step 640: timed with pme grid 104 104 104, coulomb cutoff 1.407: 8369.8
>>>> M-cycles
>>>> optimal pme grid 128 128 128, coulomb cutoff 1.200
>>>> Step Time Lambda
>>>> 2500 5.00000 0.00000
>>>>
>>>> Energies (kJ/mol)
>>>> G96Angle Proper Dih. Improper Dih. LJ-14
>>>> Coulomb-14
>>>> 9.72545e+03 4.33046e+03 2.98087e+03 -1.95794e+02
>>>> 8.05967e+04
>>>> LJ (SR) Coulomb (SR) Coul. recip. Potential Kinetic
>>>> En.
>>>> 1.01293e+06 -7.13110e+06 2.01689e+04 -6.00057e+06
>>>> 1.08489e+06
>>>> Total Energy Conserved En. Temperature Pressure (bar) Constr.
>>>> rmsd
>>>> -4.91567e+06 -4.90300e+06 3.00225e+02 1.36173e+02
>>>> 2.25998e-05
>>>>
>>>> Step Time Lambda
>>>> 5000 10.00000 0.00000
>>>>
>>>> ............
>>>>
>>>>
>>>> -------------------------------------------------------------------------------
>>>>
>>>>
>>>> Thank you in advance
>>>>
>>>> --
>>>> Carmen Di Giovanni, PhD
>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>> "Drug Discovery Lab"
>>>> University of Naples "Federico II"
>>>> Via D. Montesano, 49
>>>> 80131 Naples
>>>> Tel.: ++39 081 678623
>>>> Fax: ++39 081 678100
>>>> Email: cdigiova at unina.it
>>>>
>>>>
>>>>
>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>
>>>>>
>>>>>
>>>>> On 2/18/15 11:09 AM, Barnett, James W wrote:
>>>>>>
>>>>>>
>>>>>> What's your exact command?
>>>>>>
>>>>>
>>>>> A full .log file would be even better; it would tell us everything we
>>>>> need
>>>>> to know :)
>>>>>
>>>>> -Justin
>>>>>
>>>>>> Have you reviewed this page:
>>>>>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>>>>>
>>>>>> James "Wes" Barnett
>>>>>> Ph.D. Candidate
>>>>>> Chemical and Biomolecular Engineering
>>>>>>
>>>>>> Tulane University
>>>>>> Boggs Center for Energy and Biotechnology, Room 341-B
>>>>>>
>>>>>> ________________________________________
>>>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
>>>>>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Carmen
>>>>>> Di
>>>>>> Giovanni <cdigiova at unina.it>
>>>>>> Sent: Wednesday, February 18, 2015 10:06 AM
>>>>>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>>>>>> Subject: Re: [gmx-users] GPU low performance
>>>>>>
>>>>>> I post the message of a md run :
>>>>>>
>>>>>>
>>>>>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>
>>>>>>
>>>>>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>>>>> performance loss, consider using a shorter cut-off and a finer
>>>>>> PME
>>>>>> grid.
>>>>>>
>>>>>> As can I solved this problem ?
>>>>>> Thank you in advance
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Carmen Di Giovanni, PhD
>>>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>>>> "Drug Discovery Lab"
>>>>>> University of Naples "Federico II"
>>>>>> Via D. Montesano, 49
>>>>>> 80131 Naples
>>>>>> Tel.: ++39 081 678623
>>>>>> Fax: ++39 081 678100
>>>>>> Email: cdigiova at unina.it
>>>>>>
>>>>>>
>>>>>>
>>>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Daear all,
>>>>>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>>>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>>>>>
>>>>>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>>>
>>>>>>>
>>>>>>> Minimization is a poor indicator of performance. Do a real MD run.
>>>>>>>
>>>>>>>>
>>>>>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>>>>>> performance loss.
>>>>>>>>
>>>>>>>> Core t (s) Wall t (s) (%)
>>>>>>>> Time: 3289.010 205.891 1597.4
>>>>>>>> (steps/hour)
>>>>>>>> Performance: 8480.2
>>>>>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>>>>>
>>>>>>>>
>>>>>>>> Cai I improve the performance?
>>>>>>>> At the moment in the forum I didn't full informations to solve this
>>>>>>>> problem.
>>>>>>>> In attachment there is the log. file
>>>>>>>>
>>>>>>>
>>>>>>> The list does not accept attachments. If you wish to share a file,
>>>>>>> upload it to a file-sharing service and provide a URL. The full
>>>>>>> .log is quite important for understanding your hardware,
>>>>>>> optimizations, and seeing full details of the performance breakdown.
>>>>>>> But again, base your assessment on MD, not EM.
>>>>>>>
>>>>>>> -Justin
>>>>>>>
>>>>>>> --
>>>>>>> ==================================================
>>>>>>>
>>>>>>> Justin A. Lemkul, Ph.D.
>>>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>>>
>>>>>>> Department of Pharmaceutical Sciences
>>>>>>> School of Pharmacy
>>>>>>> Health Sciences Facility II, Room 629
>>>>>>> University of Maryland, Baltimore
>>>>>>> 20 Penn St.
>>>>>>> Baltimore, MD 21201
>>>>>>>
>>>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>>>
>>>>>>> ==================================================
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>> or send a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>
>>>>> --
>>>>> ==================================================
>>>>>
>>>>> Justin A. Lemkul, Ph.D.
>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>
>>>>> Department of Pharmaceutical Sciences
>>>>> School of Pharmacy
>>>>> Health Sciences Facility II, Room 629
>>>>> University of Maryland, Baltimore
>>>>> 20 Penn St.
>>>>> Baltimore, MD 21201
>>>>>
>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>
>>>>> ==================================================
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send
>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a
>>>> mail to gmx-users-request at gromacs.org.
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>>> a mail to gmx-users-request at gromacs.org.
>>>
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>> mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list