[gmx-users] GPU low performance

Wed Feb 18 18:18:09 CET 2015

I've just noticed something serious. Why are you calculating energies
every step? Doing that makes the non-bonded force calculation on
average 25-30% slower than e.g. calculating energies every 100-th
step.

You may be able to get another 5% or so form your GPU, could you post
the output of "nvidia-smi -q -g 0"?
--
Szilárd

On Wed, Feb 18, 2015 at 6:14 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> On Wed, Feb 18, 2015 at 5:57 PM, Carmen Di Giovanni <cdigiova at unina.it> wrote:
>> Dear all, the full log file is too big.
>
> Use pastebin or similar services.
>
>> However in the middle part of it, there are only informations about the
>> energies at each time. The first part is alrady posted.
>
> OK, so first of all, this looks nothing like the alarmingly low
> CPU-GPU overlap you posted about initially. Here, the GPU you are
> using simply can't keep up with 2x8 Haswell-E cores. You observing
> this by looking at the fraction of runtime spent by the CPU waiting
> for the GPU displayed in the performace table's "Wait GPU local" row
> which shows 28.7% idling.
>
> At the moment, the non-bonded computation which is fully don on the
> GPU can't be split between CPU and GPU, so your options are limited
> and most of these will a minor effect:
> i) indirectly shift work back to the CPU and/or improve the overlap efficiency
>   a) try decreasing nstlist to 10-20-25
>   b) run on less threads (as suggested before) which will likely
> improve performance in some non-overlap code parts
>   c) run with DD, e.g. -ntmpi 4 -ntomp 4/8 -gpu_id 0011 or -ntmpi 8
> -gpu_id 00001111
>
> ii) Reduce the "Rest" time. Not sure what's causing it, but you
> simulation spends a substantial amount (15.6%) of the runtime in
> unaccounted for likely serial calculation; i-b and i-c will likely
> reduce this somewhat too;
>
> iii) get more/faster or GPUs
>
>> So I post the final part of it:
>> -------------------------------------------------------------
>>            Step           Time         Lambda
>>        10000000    20000.00000        0.00000
>>
>> Writing checkpoint, step 10000000 at Mon Dec 29 13:16:22 2014
>>
>>
>>    Energies (kJ/mol)
>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>>     9.34206e+03    4.14342e+03    2.79172e+03   -1.75465e+02    7.99811e+04
>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>>     1.01135e+06   -7.13064e+06    2.01349e+04   -6.00306e+06    1.08201e+06
>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>>    -4.92106e+06   -5.86747e+06    2.99426e+02    1.29480e+02    2.16280e-05
>>
>>         <======  ###############  ==>
>>         <====  A V E R A G E S  ====>
>>         <==  ###############  ======>
>>
>>         Statistics over 10000001 steps using 10000001 frames
>>
>>    Energies (kJ/mol)
>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>>     9.45818e+03    4.30665e+03    2.92407e+03   -1.75556e+02    8.02473e+04
>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>>     1.01284e+06   -7.13138e+06    2.01510e+04   -6.00163e+06    1.08407e+06
>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>>    -4.91756e+06   -5.38519e+06    2.99998e+02    1.37549e+02    0.00000e+00
>>
>>    Total Virial (kJ/mol)
>>     3.42887e+05    1.63625e+01    1.23658e+02
>>     1.67406e+01    3.42916e+05   -4.27834e+01
>>     1.23997e+02   -4.29636e+01    3.42881e+05
>>
>>    Pressure (bar)
>>     1.37573e+02    7.50214e-02   -1.03916e-01
>>     7.22048e-02    1.37623e+02   -1.66417e-02
>>    -1.06444e-01   -1.52990e-02    1.37453e+02
>>
>>
>>         M E G A - F L O P S   A C C O U N T I N G
>>
>>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>>  V&F=Potential and force  V=Potential only  F=Force only
>>
>>  Computing:                               M-Number         M-Flops  % Flops
>> -----------------------------------------------------------------------------
>>  Pair Search distance check        16343508.605344   147091577.448     0.0
>>  NxN Ewald Elec. + LJ [V&F]       5072118956.506304 542716728346.174    98.1
>>  1,4 nonbonded interactions           95860.009586     8627400.863     0.0
>>  Calc Weights                      13039741.303974   469430686.943     0.1
>>  Spread Q Bspline                 278181147.818112   556362295.636     0.1
>>  Gather F Bspline                 278181147.818112  1669086886.909     0.3
>>  3D-FFT                           880787450.909824  7046299607.279     1.3
>>  Solve PME                           163837.909504    10485626.208     0.0
>>  Shift-X                             108664.934658      651989.608     0.0
>>  Angles                               86090.008609    14463121.446     0.0
>>  Propers                              31380.003138     7186020.719     0.0
>>  Impropers                            28790.002879     5988320.599     0.0
>>  Virial                             4347030.434703    78246547.825     0.0
>>  Stop-CM                            4346580.869316    43465808.693     0.0
>>  Calc-Ekin                          4346580.869316   117357683.472     0.0
>>  Lincs                                59130.017739     3547801.064     0.0
>>  Lincs-Mat                          1033080.309924     4132321.240     0.0
>>  Constraint-V                       4406580.881316    35252647.051     0.0
>>  Constraint-Vir                     4347450.434745   104338810.434     0.0
>>  Settle                             1429440.428832   461709258.513     0.1
>> -----------------------------------------------------------------------------
>>  Total                                             553500452758.122   100.0
>> -----------------------------------------------------------------------------
>>
>>
>>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>> On 1 MPI rank, each using 32 OpenMP threads
>>
>>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>>                      Ranks Threads  Count      (s)         total sum    %
>> -----------------------------------------------------------------------------
>>  Neighbor search        1   32     250001    6231.657     518475.694   1.1
>>  Launch GPU ops.        1   32   10000001    1825.689     151897.833   0.3
>>  Force                  1   32   10000001   49568.959    4124152.027   8.4
>>  PME mesh               1   32   10000001  194798.850   16207321.863  32.8
>>  Wait GPU local         1   32   10000001  170272.438   14166717.115  28.7
>>  NB X/F buffer ops.     1   32   19750001   29175.632    2427421.177   4.9
>>  Write traj.            1   32      20635    1567.928     130452.056   0.3
>>  Update                 1   32   10000001   13312.819    1107630.452   2.2
>>  Constraints            1   32   10000001   34210.142    2846293.908   5.8
>>  Rest                                       92338.781    7682613.897  15.6
>> -----------------------------------------------------------------------------
>>  Total                                     593302.894   49362976.023 100.0
>> -----------------------------------------------------------------------------
>>  Breakdown of PME mesh computation
>> -----------------------------------------------------------------------------
>>  PME spread/gather      1   32   20000002  144767.207   12044674.424  24.4
>>  PME 3D-FFT             1   32   20000002   39499.157    3286341.501   6.7
>>  PME solve Elec         1   32   10000001    9947.340     827621.589   1.7
>> -----------------------------------------------------------------------------
>>
>>  GPU timings
>> -----------------------------------------------------------------------------
>>  Computing:                         Count  Wall t (s)      ms/step       %
>> -----------------------------------------------------------------------------
>>  Pair list H2D                     250001     935.751        3.743     0.2
>>  X / q H2D                       10000001   11509.209        1.151     2.8
>>  Nonbonded F+ene k.               9750000  377111.949       38.678    92.0
>>  Nonbonded F+ene+prune k.          250001   12049.010       48.196     2.9
>>  F D2H                           10000001    8129.292        0.813     2.0
>> -----------------------------------------------------------------------------
>>  Total                                     409735.211       40.974   100.0
>> -----------------------------------------------------------------------------
>>
>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>       performance loss, consider using a shorter cut-off and a finer PME
>> grid.
>>
>>                Core t (s)   Wall t (s)        (%)
>>        Time: 18713831.228   593302.894     3154.2
>>                          6d20h48:22
>>                  (ns/day)    (hour/ns)
>> Performance:        2.913        8.240
>> Finished mdrun on rank 0 Mon Dec 29 13:16:24 2014
>>
>>
>> -------------------------------------------------------
>> thank you in advance
>> Carmen
>>
>>
>>
>> --
>> Carmen Di Giovanni, PhD
>> Dept. of Pharmaceutical and Toxicological Chemistry
>> "Drug Discovery Lab"
>> University of Naples "Federico II"
>> Via D. Montesano, 49
>> 80131 Naples
>> Tel.: ++39 081 678623
>> Fax: ++39 081 678100
>> Email: cdigiova at unina.it
>>
>>
>>
>> Quoting Szilárd Páll <pall.szilard at gmail.com>:
>>
>>> We need a *full* log file, not parts of it!
>>>
>>> You can try running with "-ntomp 16 -pin on" - it may be a bit faster
>>> not not use HyperThreading.
>>> --
>>> Szilárd
>>>
>>>
>>> On Wed, Feb 18, 2015 at 5:20 PM, Carmen Di Giovanni <cdigiova at unina.it>
>>> wrote:
>>>>
>>>> Justin,
>>>> the problem is evident for all calculations.
>>>> This is the log file  of a recent run:
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>>
>>>> Log file opened on Mon Dec 22 16:28:00 2014
>>>> Host: localhost.localdomain  pid: 8378  rank ID: 0  number of ranks:  1
>>>> GROMACS:    gmx mdrun, VERSION 5.0
>>>>
>>>> GROMACS is written by:
>>>> Emile Apol         Rossen Apostolov   Herman J.C. Berendsen Par Bjelkmar
>>>> Aldert van Buuren  Rudi van Drunen    Anton Feenstra     Sebastian
>>>> Fritsch
>>>> Gerrit Groenhof    Christoph Junghans Peter Kasson       Carsten Kutzner
>>>> Per Larsson        Justin A. Lemkul   Magnus Lundborg    Pieter
>>>> Meulenhoff
>>>> Erik Marklund      Teemu Murtola      Szilard Pall       Sander Pronk
>>>> Roland Schulz      Alexey Shvetsov    Michael Shirts     Alfons Sijbers
>>>> Peter Tieleman     Christian Wennberg Maarten Wolf
>>>> and the project leaders:
>>>> Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
>>>>
>>>> Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>>>> Copyright (c) 2001-2014, The GROMACS development team at
>>>> Uppsala University, Stockholm University and
>>>> the Royal Institute of Technology, Sweden.
>>>> check out http://www.gromacs.org for more information.
>>>>
>>>> GROMACS is free software; you can redistribute it and/or modify it
>>>> under the terms of the GNU Lesser General Public License
>>>> as published by the Free Software Foundation; either version 2.1
>>>> of the License, or (at your option) any later version.
>>>>
>>>> GROMACS:      gmx mdrun, VERSION 5.0
>>>> Executable:   /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
>>>> Library dir:  /opt/SW/gromacs-5.0/share/top
>>>> Command line:
>>>>   gmx_mpi mdrun -deffnm prod_20ns
>>>>
>>>> Gromacs version:    VERSION 5.0
>>>> Precision:          single
>>>> Memory model:       64 bit
>>>> MPI library:        MPI
>>>> OpenMP support:     enabled
>>>> GPU support:        enabled
>>>> invsqrt routine:    gmx_software_invsqrt(x)
>>>> SIMD instructions:  AVX_256
>>>> FFT library:        fftw-3.3.3-sse2
>>>> RDTSCP usage:       enabled
>>>> C++11 compilation:  disabled
>>>> TNG support:        enabled
>>>> Tracing support:    disabled
>>>> Built on:           Thu Jul 31 18:30:37 CEST 2014
>>>> Built by:           root at localhost.localdomain [CMAKE]
>>>> Build OS/arch:      Linux 2.6.32-431.el6.x86_64 x86_64
>>>> Build CPU vendor:   GenuineIntel
>>>> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>>> Build CPU family:   6   Model: 62   Stepping: 4
>>>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx
>>>> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
>>>> sse3
>>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>>> C compiler:         /usr/bin/cc GNU 4.4.7
>>>> C compiler flags:    -mavx   -Wno-maybe-uninitialized -Wextra
>>>> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
>>>> -Wno-unused -Wunused-value -Wunused-parameter   -fomit-frame-pointer
>>>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>>>> C++ compiler:       /usr/bin/c++ GNU 4.4.7
>>>> C++ compiler flags:  -mavx   -Wextra -Wno-missing-field-initializers
>>>> -Wpointer-arith -Wall -Wno-unused-function   -fomit-frame-pointer
>>>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>>>> Boost version:      1.55.0 (internal)
>>>> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
>>>> compiler
>>>> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
>>>> Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
>>>> CUDA compiler
>>>>
>>>> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
>>>> ;
>>>>
>>>> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
>>>> CUDA driver:        6.50
>>>> CUDA runtime:       6.0
>>>>
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
>>>> GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
>>>> molecular simulation
>>>> J. Chem. Theory Comput. 4 (2008) pp. 435-447
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J.
>>>> C.
>>>> Berendsen
>>>> GROMACS: Fast, Flexible and Free
>>>> J. Comp. Chem. 26 (2005) pp. 1701-1719
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> E. Lindahl and B. Hess and D. van der Spoel
>>>> GROMACS 3.0: A package for molecular simulation and trajectory analysis
>>>> J. Mol. Mod. 7 (2001) pp. 306-317
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> H. J. C. Berendsen, D. van der Spoel and R. van Drunen
>>>> GROMACS: A message-passing parallel molecular dynamics implementation
>>>> Comp. Phys. Comm. 91 (1995) pp. 43-56
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>>
>>>> For optimal performance with a GPU nstlist (now 10) should be larger.
>>>> The optimum depends on your CPU and GPU resources.
>>>> You might want to try several nstlist values.
>>>> Changing nstlist from 10 to 40, rlist from 1.2 to 1.285
>>>>
>>>> Input Parameters:
>>>>    integrator                     = md
>>>>    tinit                          = 0
>>>>    dt                             = 0.002
>>>>    nsteps                         = 10000000
>>>>    init-step                      = 0
>>>>    simulation-part                = 1
>>>>    comm-mode                      = Linear
>>>>    nstcomm                        = 1
>>>>    bd-fric                        = 0
>>>>    ld-seed                        = 1993
>>>>    emtol                          = 10
>>>>    emstep                         = 0.01
>>>>    niter                          = 20
>>>>    fcstep                         = 0
>>>>    nstcgsteep                     = 1000
>>>>    nbfgscorr                      = 10
>>>>    rtpi                           = 0.05
>>>>    nstxout                        = 2500
>>>>    nstvout                        = 2500
>>>>    nstfout                        = 0
>>>>    nstlog                         = 2500
>>>>    nstcalcenergy                  = 1
>>>>    nstenergy                      = 2500
>>>>    nstxout-compressed             = 500
>>>>    compressed-x-precision         = 1000
>>>>    cutoff-scheme                  = Verlet
>>>>    nstlist                        = 40
>>>>    ns-type                        = Grid
>>>>    pbc                            = xyz
>>>>    periodic-molecules             = FALSE
>>>>    verlet-buffer-tolerance        = 0.005
>>>>    rlist                          = 1.285
>>>>    rlistlong                      = 1.285
>>>>    nstcalclr                      = 10
>>>>    coulombtype                    = PME
>>>>    coulomb-modifier               = Potential-shift
>>>>    rcoulomb-switch                = 0
>>>>    rcoulomb                       = 1.2
>>>>    epsilon-r                      = 1
>>>>    epsilon-rf                     = 1
>>>>    vdw-type                       = Cut-off
>>>>    vdw-modifier                   = Potential-shift
>>>>    rvdw-switch                    = 0
>>>>    rvdw                           = 1.2
>>>>    DispCorr                       = No
>>>>    table-extension                = 1
>>>>    fourierspacing                 = 0.135
>>>>    fourier-nx                     = 128
>>>>    fourier-ny                     = 128
>>>>    fourier-nz                     = 128
>>>>    pme-order                      = 4
>>>>    ewald-rtol                     = 1e-05
>>>>    ewald-rtol-lj                  = 0.001
>>>>    lj-pme-comb-rule               = Geometric
>>>>    ewald-geometry                 = 0
>>>>    epsilon-surface                = 0
>>>>    implicit-solvent               = No
>>>>    gb-algorithm                   = Still
>>>>    nstgbradii                     = 1
>>>>    rgbradii                       = 2
>>>>    gb-epsilon-solvent             = 80
>>>>    gb-saltconc                    = 0
>>>>    gb-obc-alpha                   = 1
>>>>    gb-obc-beta                    = 0.8
>>>>    gb-obc-gamma                   = 4.85
>>>>    gb-dielectric-offset           = 0.009
>>>>    sa-algorithm                   = Ace-approximation
>>>>    sa-surface-tension             = 2.092
>>>>    tcoupl                         = V-rescale
>>>>    nsttcouple                     = 10
>>>>    nh-chain-length                = 0
>>>>    print-nose-hoover-chain-variables = FALSE
>>>>    pcoupl                         = No
>>>>    pcoupltype                     = Semiisotropic
>>>>    nstpcouple                     = -1
>>>>    tau-p                          = 0.5
>>>>    compressibility (3x3):
>>>>       compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>    ref-p (3x3):
>>>>       ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>    refcoord-scaling               = No
>>>>    posres-com (3):
>>>>       posres-com[0]= 0.00000e+00
>>>>       posres-com[1]= 0.00000e+00
>>>>       posres-com[2]= 0.00000e+00
>>>>    posres-comB (3):
>>>>       posres-comB[0]= 0.00000e+00
>>>>       posres-comB[1]= 0.00000e+00
>>>>       posres-comB[2]= 0.00000e+00
>>>>    QMMM                           = FALSE
>>>>    QMconstraints                  = 0
>>>>    QMMMscheme                     = 0
>>>>    MMChargeScaleFactor            = 1
>>>> qm-opts:
>>>>    ngQM                           = 0
>>>>    constraint-algorithm           = Lincs
>>>>    continuation                   = FALSE
>>>>    Shake-SOR                      = FALSE
>>>>    shake-tol                      = 0.0001
>>>>    lincs-order                    = 4
>>>>    lincs-iter                     = 1
>>>>    lincs-warnangle                = 30
>>>>    nwall                          = 0
>>>>    wall-type                      = 9-3
>>>>    wall-r-linpot                  = -1
>>>>    wall-atomtype[0]               = -1
>>>>    wall-atomtype[1]               = -1
>>>>    wall-density[0]                = 0
>>>>    wall-density[1]                = 0
>>>>    wall-ewald-zfac                = 3
>>>>    pull                           = no
>>>>    rotation                       = FALSE
>>>>    interactiveMD                  = FALSE
>>>>    disre                          = No
>>>>    disre-weighting                = Conservative
>>>>    disre-mixed                    = FALSE
>>>>    dr-fc                          = 1000
>>>>    dr-tau                         = 0
>>>>    nstdisreout                    = 100
>>>>    orire-fc                       = 0
>>>>    orire-tau                      = 0
>>>>    nstorireout                    = 100
>>>>    free-energy                    = no
>>>>    cos-acceleration               = 0
>>>>    deform (3x3):
>>>>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>>>    simulated-tempering            = FALSE
>>>>    E-x:
>>>>       n = 0
>>>>    E-xt:
>>>>       n = 0
>>>>    E-y:
>>>>       n = 0
>>>>    E-yt:
>>>>       n = 0
>>>>    E-z:
>>>>       n = 0
>>>>    E-zt:
>>>>       n = 0
>>>>    swapcoords                     = no
>>>>    adress                         = FALSE
>>>>    userint1                       = 0
>>>>    userint2                       = 0
>>>>    userint3                       = 0
>>>>    userint4                       = 0
>>>>    userreal1                      = 0
>>>>    userreal2                      = 0
>>>>    userreal3                      = 0
>>>>    userreal4                      = 0
>>>> grpopts:
>>>>    nrdf:      869226
>>>>    ref-t:         300
>>>>    tau-t:         0.1
>>>> annealing:          No
>>>> annealing-npoints:           0
>>>>    acc:            0           0           0
>>>>    nfreeze:           N           N           N
>>>>    energygrp-flags[  0]: 0
>>>> Using 1 MPI process
>>>> Using 32 OpenMP threads
>>>>
>>>> Detecting CPU SIMD instructions.
>>>> Present hardware specification:
>>>> Vendor: GenuineIntel
>>>> Brand:  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>>> Family:  6  Model: 62  Stepping:  4
>>>> Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
>>>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
>>>> sse4.1 sse4.2 ssse3 tdt x2apic
>>>> SIMD instructions most likely to fit this hardware: AVX_256
>>>> SIMD instructions selected at GROMACS compile time: AVX_256
>>>>
>>>>
>>>> 2 GPUs detected on host localhost.localdomain:
>>>>   #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
>>>>   #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat:
>>>> compatible
>>>>
>>>> 1 GPU auto-selected for this run.
>>>> Mapping of GPU to the 1 PP rank in this node: #0
>>>>
>>>>
>>>> NOTE: potentially sub-optimal launch configuration, gmx_mpi started with
>>>> less
>>>>       PP MPI process per node than GPUs available.
>>>>       Each PP MPI process can use only one GPU, 1 GPU per node will be
>>>> used.
>>>>
>>>> Will do PME sum in reciprocal space for electrostatic interactions.
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G.
>>>> Pedersen
>>>> A smooth particle mesh Ewald method
>>>> J. Chem. Phys. 103 (1995) pp. 8577-8592
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> Will do ordinary reciprocal space Ewald sum.
>>>> Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
>>>> Cut-off's:   NS: 1.285   Coulomb: 1.2   LJ: 1.2
>>>> System total charge: -0.012
>>>> Generated table with 1142 data points for Ewald.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for LJ6.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for LJ12.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 COUL.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 LJ6.
>>>> Tabscale = 500 points/nm
>>>> Generated table with 1142 data points for 1-4 LJ12.
>>>> Tabscale = 500 points/nm
>>>>
>>>> Using CUDA 8x8 non-bonded kernels
>>>>
>>>> Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
>>>> Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size:
>>>> 1536
>>>>
>>>> Removing pbc first time
>>>> Pinning threads with an auto-selected logical core stride of 1
>>>>
>>>> Initializing LINear Constraint Solver
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
>>>> LINCS: A Linear Constraint Solver for molecular simulations
>>>> J. Comp. Chem. 18 (1997) pp. 1463-1472
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> The number of constraints is 5913
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> S. Miyamoto and P. A. Kollman
>>>> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for
>>>> Rigid
>>>> Water Models
>>>> J. Comp. Chem. 13 (1992) pp. 952-962
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> Center of mass motion removal mode is Linear
>>>> We have the following groups for center of mass motion removal:
>>>>   0:  rest
>>>>
>>>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>>>> G. Bussi, D. Donadio and M. Parrinello
>>>> Canonical sampling through velocity rescaling
>>>> J. Chem. Phys. 126 (2007) pp. 014101
>>>> -------- -------- --- Thank You --- -------- --------
>>>>
>>>> There are: 434658 Atoms
>>>>
>>>> Constraining the starting coordinates (step 0)
>>>>
>>>> Constraining the coordinates at t0-dt (step 0)
>>>> RMS relative constraint deviation after constraining: 3.67e-05
>>>> Initial temperature: 300.5 K
>>>>
>>>> Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
>>>>            Step           Time         Lambda
>>>>               0        0.00000        0.00000
>>>>
>>>>    Energies (kJ/mol)
>>>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>>> Coulomb-14
>>>>     9.74139e+03    4.34956e+03    2.97359e+03   -1.93107e+02
>>>> 8.05534e+04
>>>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic
>>>> En.
>>>>     1.01340e+06   -7.13271e+06    2.01361e+04   -6.00175e+06
>>>> 1.09887e+06
>>>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr.
>>>> rmsd
>>>>    -4.90288e+06   -4.90288e+06    3.04092e+02    1.70897e+02
>>>> 2.16683e-05
>>>>
>>>> step   80: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6279.0
>>>> M-cycles
>>>> step  160: timed with pme grid 112 112 112, coulomb cutoff 1.306: 6962.2
>>>> M-cycles
>>>> step  240: timed with pme grid 100 100 100, coulomb cutoff 1.463: 8406.5
>>>> M-cycles
>>>> step  320: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6424.0
>>>> M-cycles
>>>> step  400: timed with pme grid 120 120 120, coulomb cutoff 1.219: 6369.1
>>>> M-cycles
>>>> step  480: timed with pme grid 112 112 112, coulomb cutoff 1.306: 7309.0
>>>> M-cycles
>>>> step  560: timed with pme grid 108 108 108, coulomb cutoff 1.355: 7521.2
>>>> M-cycles
>>>> step  640: timed with pme grid 104 104 104, coulomb cutoff 1.407: 8369.8
>>>> M-cycles
>>>>               optimal pme grid 128 128 128, coulomb cutoff 1.200
>>>>            Step           Time         Lambda
>>>>            2500        5.00000        0.00000
>>>>
>>>>    Energies (kJ/mol)
>>>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14
>>>> Coulomb-14
>>>>     9.72545e+03    4.33046e+03    2.98087e+03   -1.95794e+02
>>>> 8.05967e+04
>>>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic
>>>> En.
>>>>     1.01293e+06   -7.13110e+06    2.01689e+04   -6.00057e+06
>>>> 1.08489e+06
>>>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr.
>>>> rmsd
>>>>    -4.91567e+06   -4.90300e+06    3.00225e+02    1.36173e+02
>>>> 2.25998e-05
>>>>
>>>>            Step           Time         Lambda
>>>>            5000       10.00000        0.00000
>>>>
>>>> ............
>>>>
>>>>
>>>> -------------------------------------------------------------------------------
>>>>
>>>>
>>>> Thank you in advance
>>>>
>>>> --
>>>> Carmen Di Giovanni, PhD
>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>> "Drug Discovery Lab"
>>>> University of Naples "Federico II"
>>>> Via D. Montesano, 49
>>>> 80131 Naples
>>>> Tel.: ++39 081 678623
>>>> Fax: ++39 081 678100
>>>> Email: cdigiova at unina.it
>>>>
>>>>
>>>>
>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>
>>>>>
>>>>>
>>>>> On 2/18/15 11:09 AM, Barnett, James W wrote:
>>>>>>
>>>>>>
>>>>>> What's your exact command?
>>>>>>
>>>>>
>>>>> A full .log file would be even better; it would tell us everything we
>>>>> need
>>>>> to know :)
>>>>>
>>>>> -Justin
>>>>>
>>>>>> Have you reviewed this page:
>>>>>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>>>>>
>>>>>> James "Wes" Barnett
>>>>>> Ph.D. Candidate
>>>>>> Chemical and Biomolecular Engineering
>>>>>>
>>>>>> Tulane University
>>>>>> Boggs Center for Energy and Biotechnology, Room 341-B
>>>>>>
>>>>>> ________________________________________
>>>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
>>>>>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Carmen
>>>>>> Di
>>>>>> Giovanni <cdigiova at unina.it>
>>>>>> Sent: Wednesday, February 18, 2015 10:06 AM
>>>>>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>>>>>> Subject: Re: [gmx-users] GPU low performance
>>>>>>
>>>>>> I post the message of a md run :
>>>>>>
>>>>>>
>>>>>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>
>>>>>>
>>>>>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>>>>>        performance loss, consider using a shorter cut-off and a finer
>>>>>> PME
>>>>>> grid.
>>>>>>
>>>>>> As can I solved this problem ?
>>>>>> Thank you in advance
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Carmen Di Giovanni, PhD
>>>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>>>> "Drug Discovery Lab"
>>>>>> University of Naples "Federico II"
>>>>>> Via D. Montesano, 49
>>>>>> 80131 Naples
>>>>>> Tel.: ++39 081 678623
>>>>>> Fax: ++39 081 678100
>>>>>> Email: cdigiova at unina.it
>>>>>>
>>>>>>
>>>>>>
>>>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Daear all,
>>>>>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>>>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>>>>>
>>>>>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>>>
>>>>>>>
>>>>>>> Minimization is a poor indicator of performance.  Do a real MD run.
>>>>>>>
>>>>>>>>
>>>>>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>>>>>> performance loss.
>>>>>>>>
>>>>>>>> Core t (s) Wall t (s) (%)
>>>>>>>> Time: 3289.010 205.891 1597.4
>>>>>>>> (steps/hour)
>>>>>>>> Performance: 8480.2
>>>>>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>>>>>
>>>>>>>>
>>>>>>>> Cai I improve the performance?
>>>>>>>> At the moment in the forum I didn't full informations to solve this
>>>>>>>> problem.
>>>>>>>> In attachment there is the log. file
>>>>>>>>
>>>>>>>
>>>>>>> The list does not accept attachments.  If you wish to share a file,
>>>>>>> upload it to a file-sharing service and provide a URL.  The full
>>>>>>> .log is quite important for understanding your hardware,
>>>>>>> optimizations, and seeing full details of the performance breakdown.
>>>>>>>  But again, base your assessment on MD, not EM.
>>>>>>>
>>>>>>> -Justin
>>>>>>>
>>>>>>> --
>>>>>>> ==================================================
>>>>>>>
>>>>>>> Justin A. Lemkul, Ph.D.
>>>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>>>
>>>>>>> Department of Pharmaceutical Sciences
>>>>>>> School of Pharmacy
>>>>>>> Health Sciences Facility II, Room 629
>>>>>>> University of Maryland, Baltimore
>>>>>>> 20 Penn St.
>>>>>>> Baltimore, MD 21201
>>>>>>>
>>>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>>>
>>>>>>> ==================================================
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>> or send a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>
>>>>> --
>>>>> ==================================================
>>>>>
>>>>> Justin A. Lemkul, Ph.D.
>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>
>>>>> Department of Pharmaceutical Sciences
>>>>> School of Pharmacy
>>>>> Health Sciences Facility II, Room 629
>>>>> University of Maryland, Baltimore
>>>>> 20 Penn St.
>>>>> Baltimore, MD 21201
>>>>>
>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>
>>>>> ==================================================
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send
>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a
>>>> mail to gmx-users-request at gromacs.org.
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>>> a mail to gmx-users-request at gromacs.org.
>>>
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>> mail to gmx-users-request at gromacs.org.