[gmx-users] GPU low performance

Wed Feb 18 18:14:29 CET 2015

On Wed, Feb 18, 2015 at 5:57 PM, Carmen Di Giovanni <cdigiova at unina.it> wrote:
> Dear all, the full log file is too big.

Use pastebin or similar services.

> However in the middle part of it, there are only informations about the
> energies at each time. The first part is alrady posted.

OK, so first of all, this looks nothing like the alarmingly low
CPU-GPU overlap you posted about initially. Here, the GPU you are
using simply can't keep up with 2x8 Haswell-E cores. You observing
this by looking at the fraction of runtime spent by the CPU waiting
for the GPU displayed in the performace table's "Wait GPU local" row
which shows 28.7% idling.

At the moment, the non-bonded computation which is fully don on the
GPU can't be split between CPU and GPU, so your options are limited
and most of these will a minor effect:
i) indirectly shift work back to the CPU and/or improve the overlap efficiency
  a) try decreasing nstlist to 10-20-25
  b) run on less threads (as suggested before) which will likely
improve performance in some non-overlap code parts
  c) run with DD, e.g. -ntmpi 4 -ntomp 4/8 -gpu_id 0011 or -ntmpi 8
-gpu_id 00001111

ii) Reduce the "Rest" time. Not sure what's causing it, but you
simulation spends a substantial amount (15.6%) of the runtime in
unaccounted for likely serial calculation; i-b and i-c will likely
reduce this somewhat too;

iii) get more/faster or GPUs

> So I post the final part of it:
> -------------------------------------------------------------
>            Step           Time         Lambda
>        10000000    20000.00000        0.00000
>
> Writing checkpoint, step 10000000 at Mon Dec 29 13:16:22 2014
>
>
>    Energies (kJ/mol)
>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>     9.34206e+03    4.14342e+03    2.79172e+03   -1.75465e+02    7.99811e+04
>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>     1.01135e+06   -7.13064e+06    2.01349e+04   -6.00306e+06    1.08201e+06
>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>    -4.92106e+06   -5.86747e+06    2.99426e+02    1.29480e+02    2.16280e-05
>
>         <======  ###############  ==>
>         <====  A V E R A G E S  ====>
>         <==  ###############  ======>
>
>         Statistics over 10000001 steps using 10000001 frames
>
>    Energies (kJ/mol)
>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>     9.45818e+03    4.30665e+03    2.92407e+03   -1.75556e+02    8.02473e+04
>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>     1.01284e+06   -7.13138e+06    2.01510e+04   -6.00163e+06    1.08407e+06
>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>    -4.91756e+06   -5.38519e+06    2.99998e+02    1.37549e+02    0.00000e+00
>
>    Total Virial (kJ/mol)
>     3.42887e+05    1.63625e+01    1.23658e+02
>     1.67406e+01    3.42916e+05   -4.27834e+01
>     1.23997e+02   -4.29636e+01    3.42881e+05
>
>    Pressure (bar)
>     1.37573e+02    7.50214e-02   -1.03916e-01
>     7.22048e-02    1.37623e+02   -1.66417e-02
>    -1.06444e-01   -1.52990e-02    1.37453e+02
>
>
>         M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  Pair Search distance check        16343508.605344   147091577.448     0.0
>  NxN Ewald Elec. + LJ [V&F]       5072118956.506304 542716728346.174    98.1
>  1,4 nonbonded interactions           95860.009586     8627400.863     0.0
>  Calc Weights                      13039741.303974   469430686.943     0.1
>  Spread Q Bspline                 278181147.818112   556362295.636     0.1
>  Gather F Bspline                 278181147.818112  1669086886.909     0.3
>  3D-FFT                           880787450.909824  7046299607.279     1.3
>  Solve PME                           163837.909504    10485626.208     0.0
>  Shift-X                             108664.934658      651989.608     0.0
>  Angles                               86090.008609    14463121.446     0.0
>  Propers                              31380.003138     7186020.719     0.0
>  Impropers                            28790.002879     5988320.599     0.0
>  Virial                             4347030.434703    78246547.825     0.0
>  Stop-CM                            4346580.869316    43465808.693     0.0
>  Calc-Ekin                          4346580.869316   117357683.472     0.0
>  Lincs                                59130.017739     3547801.064     0.0
>  Lincs-Mat                          1033080.309924     4132321.240     0.0
>  Constraint-V                       4406580.881316    35252647.051     0.0
>  Constraint-Vir                     4347450.434745   104338810.434     0.0
>  Settle                             1429440.428832   461709258.513     0.1
> -----------------------------------------------------------------------------
>  Total                                             553500452758.122   100.0
> -----------------------------------------------------------------------------
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 32 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Neighbor search        1   32     250001    6231.657     518475.694   1.1
>  Launch GPU ops.        1   32   10000001    1825.689     151897.833   0.3
>  Force                  1   32   10000001   49568.959    4124152.027   8.4
>  PME mesh               1   32   10000001  194798.850   16207321.863  32.8
>  Wait GPU local         1   32   10000001  170272.438   14166717.115  28.7
>  NB X/F buffer ops.     1   32   19750001   29175.632    2427421.177   4.9
>  Write traj.            1   32      20635    1567.928     130452.056   0.3
>  Update                 1   32   10000001   13312.819    1107630.452   2.2
>  Constraints            1   32   10000001   34210.142    2846293.908   5.8
>  Rest                                       92338.781    7682613.897  15.6
> -----------------------------------------------------------------------------
>  Total                                     593302.894   49362976.023 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME spread/gather      1   32   20000002  144767.207   12044674.424  24.4
>  PME 3D-FFT             1   32   20000002   39499.157    3286341.501   6.7
>  PME solve Elec         1   32   10000001    9947.340     827621.589   1.7
> -----------------------------------------------------------------------------
>
>  GPU timings
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> -----------------------------------------------------------------------------
>  Pair list H2D                     250001     935.751        3.743     0.2
>  X / q H2D                       10000001   11509.209        1.151     2.8
>  Nonbonded F+ene k.               9750000  377111.949       38.678    92.0
>  Nonbonded F+ene+prune k.          250001   12049.010       48.196     2.9
>  F D2H                           10000001    8129.292        0.813     2.0
> -----------------------------------------------------------------------------
>  Total                                     409735.211       40.974   100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>       performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
>                Core t (s)   Wall t (s)        (%)
>        Time: 18713831.228   593302.894     3154.2
>                          6d20h48:22
>                  (ns/day)    (hour/ns)
> Performance:        2.913        8.240
> Finished mdrun on rank 0 Mon Dec 29 13:16:24 2014
>
>
> -------------------------------------------------------
> thank you in advance
> Carmen
>
>
>
> --
> Carmen Di Giovanni, PhD
> Dept. of Pharmaceutical and Toxicological Chemistry
> "Drug Discovery Lab"
> University of Naples "Federico II"
> Via D. Montesano, 49
> 80131 Naples
> Tel.: ++39 081 678623
> Fax: ++39 081 678100
> Email: cdigiova at unina.it
>
>
>
> Quoting Szilárd Páll <pall.szilard at gmail.com>:
>
>> We need a *full* log file, not parts of it!
>>
>> You can try running with "-ntomp 16 -pin on" - it may be a bit faster
>> not not use HyperThreading.
>> --
>> Szilárd
>>
>>
>> On Wed, Feb 18, 2015 at 5:20 PM, Carmen Di Giovanni <cdigiova at unina.it>
>> wrote:
>>>
>>> Justin,
>>> the problem is evident for all calculations.
>>> This is the log file  of a recent run:
>>>
>>>
>>> --------------------------------------------------------------------------------
>>>
>>> Log file opened on Mon Dec 22 16:28:00 2014
>>> Host: localhost.localdomain  pid: 8378  rank ID: 0  number of ranks:  1
>>> GROMACS:    gmx mdrun, VERSION 5.0
>>>
>>> GROMACS is written by:
>>> Emile Apol         Rossen Apostolov   Herman J.C. Berendsen Par Bjelkmar
>>> Aldert van Buuren  Rudi van Drunen    Anton Feenstra     Sebastian
>>> Fritsch
>>> Gerrit Groenhof    Christoph Junghans Peter Kasson       Carsten Kutzner
>>> Per Larsson        Justin A. Lemkul   Magnus Lundborg    Pieter
>>> Meulenhoff
>>> Erik Marklund      Teemu Murtola      Szilard Pall       Sander Pronk
>>> Roland Schulz      Alexey Shvetsov    Michael Shirts     Alfons Sijbers
>>> Peter Tieleman     Christian Wennberg Maarten Wolf
>>> and the project leaders:
>>> Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
>>>
>>> Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>>> Copyright (c) 2001-2014, The GROMACS development team at
>>> Uppsala University, Stockholm University and
>>> the Royal Institute of Technology, Sweden.
>>> check out http://www.gromacs.org for more information.
>>>
>>> GROMACS is free software; you can redistribute it and/or modify it
>>> under the terms of the GNU Lesser General Public License
>>> as published by the Free Software Foundation; either version 2.1
>>> of the License, or (at your option) any later version.
>>>
>>> GROMACS:      gmx mdrun, VERSION 5.0
>>> Executable:   /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
>>> Library dir:  /opt/SW/gromacs-5.0/share/top
>>> Command line:
>>>   gmx_mpi mdrun -deffnm prod_20ns
>>>
>>> Gromacs version:    VERSION 5.0
>>> Precision:          single
>>> Memory model:       64 bit
>>> MPI library:        MPI
>>> OpenMP support:     enabled
>>> GPU support:        enabled
>>> invsqrt routine:    gmx_software_invsqrt(x)
>>> SIMD instructions:  AVX_256
>>> FFT library:        fftw-3.3.3-sse2
>>> RDTSCP usage:       enabled
>>> C++11 compilation:  disabled
>>> TNG support:        enabled
>>> Tracing support:    disabled
>>> Built on:           Thu Jul 31 18:30:37 CEST 2014
>>> Built by:           root at localhost.localdomain [CMAKE]
>>> Build OS/arch:      Linux 2.6.32-431.el6.x86_64 x86_64
>>> Build CPU vendor:   GenuineIntel
>>> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>> Build CPU family:   6   Model: 62   Stepping: 4
>>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx
>>> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
>>> sse3
>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>> C compiler:         /usr/bin/cc GNU 4.4.7
>>> C compiler flags:    -mavx   -Wno-maybe-uninitialized -Wextra
>>> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
>>> -Wno-unused -Wunused-value -Wunused-parameter   -fomit-frame-pointer
>>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>>> C++ compiler:       /usr/bin/c++ GNU 4.4.7
>>> C++ compiler flags:  -mavx   -Wextra -Wno-missing-field-initializers
>>> -Wpointer-arith -Wall -Wno-unused-function   -fomit-frame-pointer
>>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>>> Boost version:      1.55.0 (internal)
>>> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
>>> compiler
>>> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
>>> Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
>>> CUDA compiler
>>>
>>> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
>>> ;
>>>
>>> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
>>> CUDA driver:        6.50
>>> CUDA runtime:       6.0
>>>
>>>
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
>>> GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
>>> molecular simulation
>>> J. Chem. Theory Comput. 4 (2008) pp. 435-447
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J.
>>> C.
>>> Berendsen
>>> GROMACS: Fast, Flexible and Free
>>> J. Comp. Chem. 26 (2005) pp. 1701-1719
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> E. Lindahl and B. Hess and D. van der Spoel
>>> GROMACS 3.0: A package for molecular simulation and trajectory analysis
>>> J. Mol. Mod. 7 (2001) pp. 306-317
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> H. J. C. Berendsen, D. van der Spoel and R. van Drunen
>>> GROMACS: A message-passing parallel molecular dynamics implementation
>>> Comp. Phys. Comm. 91 (1995) pp. 43-56
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>>
>>> For optimal performance with a GPU nstlist (now 10) should be larger.
>>> The optimum depends on your CPU and GPU resources.
>>> You might want to try several nstlist values.
>>> Changing nstlist from 10 to 40, rlist from 1.2 to 1.285
>>>
>>> Input Parameters:
>>>    integrator                     = md
>>>    tinit                          = 0
>>>    dt                             = 0.002
>>>    nsteps                         = 10000000
>>>    init-step                      = 0
>>>    simulation-part                = 1
>>>    comm-mode                      = Linear
>>>    nstcomm                        = 1
>>>    bd-fric                        = 0
>>>    ld-seed                        = 1993
>>>    emtol                          = 10
>>>    emstep                         = 0.01
>>>    niter                          = 20
>>>    fcstep                         = 0
>>>    nstcgsteep                     = 1000
>>>    nbfgscorr                      = 10
>>>    rtpi                           = 0.05
>>>    nstxout                        = 2500
>>>    nstvout                        = 2500
>>>    nstfout                        = 0
>>>    nstlog                         = 2500
>>>    nstcalcenergy                  = 1
>>>    nstenergy                      = 2500
>>>    nstxout-compressed             = 500
>>>    compressed-x-precision         = 1000
>>>    cutoff-scheme                  = Verlet
>>>    nstlist                        = 40
>>>    ns-type                        = Grid
>>>    pbc                            = xyz
>>>    periodic-molecules             = FALSE
>>>    verlet-buffer-tolerance        = 0.005
>>>    rlist                          = 1.285
>>>    rlistlong                      = 1.285
>>>    nstcalclr                      = 10
>>>    coulombtype                    = PME
>>>    coulomb-modifier               = Potential-shift
>>>    rcoulomb-switch                = 0
>>>    rcoulomb                       = 1.2
>>>    epsilon-r                      = 1
>>>    epsilon-rf                     = 1
>>>    vdw-type                       = Cut-off
>>>    vdw-modifier                   = Potential-shift
>>>    rvdw-switch                    = 0
>>>    rvdw                           = 1.2
>>>    DispCorr                       = No
>>>    table-extension                = 1
>>>    fourierspacing                 = 0.135
>>>    fourier-nx                     = 128
>>>    fourier-ny                     = 128
>>>    fourier-nz                     = 128
>>>    pme-order                      = 4
>>>    ewald-rtol                     = 1e-05
>>>    ewald-rtol-lj                  = 0.001
>>>    lj-pme-comb-rule               = Geometric
>>>    ewald-geometry                 = 0
>>>    epsilon-surface                = 0
>>>    implicit-solvent               = No
>>>    gb-algorithm                   = Still
>>>    nstgbradii                     = 1
>>>    rgbradii                       = 2
>>>    gb-epsilon-solvent             = 80
>>>    gb-saltconc                    = 0
>>>    gb-obc-alpha                   = 1
>>>    gb-obc-beta                    = 0.8
>>>    gb-obc-gamma                   = 4.85
>>>    gb-dielectric-offset           = 0.009
>>>    sa-algorithm                   = Ace-approximation
>>>    sa-surface-tension             = 2.092
>>>    tcoupl                         = V-rescale
>>>    nsttcouple                     = 10
>>>    nh-chain-length                = 0
>>>    print-nose-hoover-chain-variables = FALSE
>>>    pcoupl                         = No
>>>    pcoupltype                     = Semiisotropic
>>>    nstpcouple                     = -1
>>>    tau-p                          = 0.5
>>>    compressibility (3x3):
>>>       compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>    ref-p (3x3):
>>>       ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>    refcoord-scaling               = No
>>>    posres-com (3):
>>>       posres-com[0]= 0.00000e+00
>>>       posres-com[1]= 0.00000e+00
>>>       posres-com[2]= 0.00000e+00
>>>    posres-comB (3):
>>>       posres-comB[0]= 0.00000e+00
>>>       posres-comB[1]= 0.00000e+00
>>>       posres-comB[2]= 0.00000e+00
>>>    QMMM                           = FALSE
>>>    QMconstraints                  = 0
>>>    QMMMscheme                     = 0
>>>    MMChargeScaleFactor            = 1
>>> qm-opts:
>>>    ngQM                           = 0
>>>    constraint-algorithm           = Lincs
>>>    continuation                   = FALSE
>>>    Shake-SOR                      = FALSE
>>>    shake-tol                      = 0.0001
>>>    lincs-order                    = 4
>>>    lincs-iter                     = 1
>>>    lincs-warnangle                = 30
>>>    nwall                          = 0
>>>    wall-type                      = 9-3
>>>    wall-r-linpot                  = -1
>>>    wall-atomtype[0]               = -1
>>>    wall-atomtype[1]               = -1
>>>    wall-density[0]                = 0
>>>    wall-density[1]                = 0
>>>    wall-ewald-zfac                = 3
>>>    pull                           = no
>>>    rotation                       = FALSE
>>>    interactiveMD                  = FALSE
>>>    disre                          = No
>>>    disre-weighting                = Conservative
>>>    disre-mixed                    = FALSE
>>>    dr-fc                          = 1000
>>>    dr-tau                         = 0
>>>    nstdisreout                    = 100
>>>    orire-fc                       = 0
>>>    orire-tau                      = 0
>>>    nstorireout                    = 100
>>>    free-energy                    = no
>>>    cos-acceleration               = 0
>>>    deform (3x3):
>>>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>    simulated-tempering            = FALSE
>>>    E-x:
>>>       n = 0
>>>    E-xt:
>>>       n = 0
>>>    E-y:
>>>       n = 0
>>>    E-yt:
>>>       n = 0
>>>    E-z:
>>>       n = 0
>>>    E-zt:
>>>       n = 0
>>>    swapcoords                     = no
>>>    adress                         = FALSE
>>>    userint1                       = 0
>>>    userint2                       = 0
>>>    userint3                       = 0
>>>    userint4                       = 0
>>>    userreal1                      = 0
>>>    userreal2                      = 0
>>>    userreal3                      = 0
>>>    userreal4                      = 0
>>> grpopts:
>>>    nrdf:      869226
>>>    ref-t:         300
>>>    tau-t:         0.1
>>> annealing:          No
>>> annealing-npoints:           0
>>>    acc:            0           0           0
>>>    nfreeze:           N           N           N
>>>    energygrp-flags[  0]: 0
>>> Using 1 MPI process
>>> Using 32 OpenMP threads
>>>
>>> Detecting CPU SIMD instructions.
>>> Present hardware specification:
>>> Vendor: GenuineIntel
>>> Brand:  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>> Family:  6  Model: 62  Stepping:  4
>>> Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
>>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>> SIMD instructions most likely to fit this hardware: AVX_256
>>> SIMD instructions selected at GROMACS compile time: AVX_256
>>>
>>>
>>> 2 GPUs detected on host localhost.localdomain:
>>>   #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
>>>   #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat:
>>> compatible
>>>
>>> 1 GPU auto-selected for this run.
>>> Mapping of GPU to the 1 PP rank in this node: #0
>>>
>>>
>>> NOTE: potentially sub-optimal launch configuration, gmx_mpi started with
>>> less
>>>       PP MPI process per node than GPUs available.
>>>       Each PP MPI process can use only one GPU, 1 GPU per node will be
>>> used.
>>>
>>> Will do PME sum in reciprocal space for electrostatic interactions.
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
>>> Pedersen
>>> A smooth particle mesh Ewald method
>>> J. Chem. Phys. 103 (1995) pp. 8577-8592
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>> Will do ordinary reciprocal space Ewald sum.
>>> Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
>>> Cut-off's:   NS: 1.285   Coulomb: 1.2   LJ: 1.2
>>> System total charge: -0.012
>>> Generated table with 1142 data points for Ewald.
>>> Tabscale = 500 points/nm
>>> Generated table with 1142 data points for LJ6.
>>> Tabscale = 500 points/nm
>>> Generated table with 1142 data points for LJ12.
>>> Tabscale = 500 points/nm
>>> Generated table with 1142 data points for 1-4 COUL.
>>> Tabscale = 500 points/nm
>>> Generated table with 1142 data points for 1-4 LJ6.
>>> Tabscale = 500 points/nm
>>> Generated table with 1142 data points for 1-4 LJ12.
>>> Tabscale = 500 points/nm
>>>
>>> Using CUDA 8x8 non-bonded kernels
>>>
>>> Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
>>> Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size:
>>> 1536
>>>
>>> Removing pbc first time
>>> Pinning threads with an auto-selected logical core stride of 1
>>>
>>> Initializing LINear Constraint Solver
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
>>> LINCS: A Linear Constraint Solver for molecular simulations
>>> J. Comp. Chem. 18 (1997) pp. 1463-1472
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>> The number of constraints is 5913
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> S. Miyamoto and P. A. Kollman
>>> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for
>>> Rigid
>>> Water Models
>>> J. Comp. Chem. 13 (1992) pp. 952-962
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>> Center of mass motion removal mode is Linear
>>> We have the following groups for center of mass motion removal:
>>>   0:  rest
>>>
>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>> G. Bussi, D. Donadio and M. Parrinello
>>> Canonical sampling through velocity rescaling
>>> J. Chem. Phys. 126 (2007) pp. 014101
>>> -------- -------- --- Thank You --- -------- --------
>>>
>>> There are: 434658 Atoms
>>>
>>> Constraining the starting coordinates (step 0)
>>>
>>> Constraining the coordinates at t0-dt (step 0)
>>> RMS relative constraint deviation after constraining: 3.67e-05
>>> Initial temperature: 300.5 K
>>>
>>> Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
>>>            Step           Time         Lambda
>>>               0        0.00000        0.00000
>>>
>>>    Energies (kJ/mol)
>>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>> Coulomb-14
>>>     9.74139e+03    4.34956e+03    2.97359e+03   -1.93107e+02
>>> 8.05534e+04
>>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic
>>> En.
>>>     1.01340e+06   -7.13271e+06    2.01361e+04   -6.00175e+06
>>> 1.09887e+06
>>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr.
>>> rmsd
>>>    -4.90288e+06   -4.90288e+06    3.04092e+02    1.70897e+02
>>> 2.16683e-05
>>>
>>> step   80: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6279.0
>>> M-cycles
>>> step  160: timed with pme grid 112 112 112, coulomb cutoff 1.306: 6962.2
>>> M-cycles
>>> step  240: timed with pme grid 100 100 100, coulomb cutoff 1.463: 8406.5
>>> M-cycles
>>> step  320: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6424.0
>>> M-cycles
>>> step  400: timed with pme grid 120 120 120, coulomb cutoff 1.219: 6369.1
>>> M-cycles
>>> step  480: timed with pme grid 112 112 112, coulomb cutoff 1.306: 7309.0
>>> M-cycles
>>> step  560: timed with pme grid 108 108 108, coulomb cutoff 1.355: 7521.2
>>> M-cycles
>>> step  640: timed with pme grid 104 104 104, coulomb cutoff 1.407: 8369.8
>>> M-cycles
>>>               optimal pme grid 128 128 128, coulomb cutoff 1.200
>>>            Step           Time         Lambda
>>>            2500        5.00000        0.00000
>>>
>>>    Energies (kJ/mol)
>>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>> Coulomb-14
>>>     9.72545e+03    4.33046e+03    2.98087e+03   -1.95794e+02
>>> 8.05967e+04
>>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic
>>> En.
>>>     1.01293e+06   -7.13110e+06    2.01689e+04   -6.00057e+06
>>> 1.08489e+06
>>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr.
>>> rmsd
>>>    -4.91567e+06   -4.90300e+06    3.00225e+02    1.36173e+02
>>> 2.25998e-05
>>>
>>>            Step           Time         Lambda
>>>            5000       10.00000        0.00000
>>>
>>> ............
>>>
>>>
>>> -------------------------------------------------------------------------------
>>>
>>>
>>> Thank you in advance
>>>
>>> --
>>> Carmen Di Giovanni, PhD
>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>> "Drug Discovery Lab"
>>> University of Naples "Federico II"
>>> Via D. Montesano, 49
>>> 80131 Naples
>>> Tel.: ++39 081 678623
>>> Fax: ++39 081 678100
>>> Email: cdigiova at unina.it
>>>
>>>
>>>
>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>
>>>>
>>>>
>>>> On 2/18/15 11:09 AM, Barnett, James W wrote:
>>>>>
>>>>>
>>>>> What's your exact command?
>>>>>
>>>>
>>>> A full .log file would be even better; it would tell us everything we
>>>> need
>>>> to know :)
>>>>
>>>> -Justin
>>>>
>>>>> Have you reviewed this page:
>>>>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>>>>
>>>>> James "Wes" Barnett
>>>>> Ph.D. Candidate
>>>>> Chemical and Biomolecular Engineering
>>>>>
>>>>> Tulane University
>>>>> Boggs Center for Energy and Biotechnology, Room 341-B
>>>>>
>>>>> ________________________________________
>>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
>>>>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Carmen
>>>>> Di
>>>>> Giovanni <cdigiova at unina.it>
>>>>> Sent: Wednesday, February 18, 2015 10:06 AM
>>>>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>>>>> Subject: Re: [gmx-users] GPU low performance
>>>>>
>>>>> I post the message of a md run :
>>>>>
>>>>>
>>>>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>>>>> For optimal performance this ratio should be close to 1!
>>>>>
>>>>>
>>>>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>>>>        performance loss, consider using a shorter cut-off and a finer
>>>>> PME
>>>>> grid.
>>>>>
>>>>> As can I solved this problem ?
>>>>> Thank you in advance
>>>>>
>>>>>
>>>>> --
>>>>> Carmen Di Giovanni, PhD
>>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>>> "Drug Discovery Lab"
>>>>> University of Naples "Federico II"
>>>>> Via D. Montesano, 49
>>>>> 80131 Naples
>>>>> Tel.: ++39 081 678623
>>>>> Fax: ++39 081 678100
>>>>> Email: cdigiova at unina.it
>>>>>
>>>>>
>>>>>
>>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>>>>>
>>>>>>>
>>>>>>> Daear all,
>>>>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>>>>
>>>>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>>
>>>>>>
>>>>>> Minimization is a poor indicator of performance.  Do a real MD run.
>>>>>>
>>>>>>>
>>>>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>>>>> performance loss.
>>>>>>>
>>>>>>> Core t (s) Wall t (s) (%)
>>>>>>> Time: 3289.010 205.891 1597.4
>>>>>>> (steps/hour)
>>>>>>> Performance: 8480.2
>>>>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>>>>
>>>>>>>
>>>>>>> Cai I improve the performance?
>>>>>>> At the moment in the forum I didn't full informations to solve this
>>>>>>> problem.
>>>>>>> In attachment there is the log. file
>>>>>>>
>>>>>>
>>>>>> The list does not accept attachments.  If you wish to share a file,
>>>>>> upload it to a file-sharing service and provide a URL.  The full
>>>>>> .log is quite important for understanding your hardware,
>>>>>> optimizations, and seeing full details of the performance breakdown.
>>>>>>  But again, base your assessment on MD, not EM.
>>>>>>
>>>>>> -Justin
>>>>>>
>>>>>> --
>>>>>> ==================================================
>>>>>>
>>>>>> Justin A. Lemkul, Ph.D.
>>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>>
>>>>>> Department of Pharmaceutical Sciences
>>>>>> School of Pharmacy
>>>>>> Health Sciences Facility II, Room 629
>>>>>> University of Maryland, Baltimore
>>>>>> 20 Penn St.
>>>>>> Baltimore, MD 21201
>>>>>>
>>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>>
>>>>>> ==================================================
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>> or send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>
>>>> --
>>>> ==================================================
>>>>
>>>> Justin A. Lemkul, Ph.D.
>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>
>>>> Department of Pharmaceutical Sciences
>>>> School of Pharmacy
>>>> Health Sciences Facility II, Room 629
>>>> University of Maryland, Baltimore
>>>> 20 Penn St.
>>>> Baltimore, MD 21201
>>>>
>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>
>>>> ==================================================
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send
>>>> a mail to gmx-users-request at gromacs.org.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a
>>> mail to gmx-users-request at gromacs.org.
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-request at gromacs.org.