[gmx-users] GPU low performance

Wed Feb 18 17:57:33 CET 2015

Dear all, the full log file is too big.
However in the middle part of it, there are only informations about  
the energies at each time. The first part is alrady posted.
So I post the final part of it:
-------------------------------------------------------------
            Step           Time         Lambda
        10000000    20000.00000        0.00000

Writing checkpoint, step 10000000 at Mon Dec 29 13:16:22 2014

    Energies (kJ/mol)
        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
     9.34206e+03    4.14342e+03    2.79172e+03   -1.75465e+02    7.99811e+04
         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
     1.01135e+06   -7.13064e+06    2.01349e+04   -6.00306e+06    1.08201e+06
    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
    -4.92106e+06   -5.86747e+06    2.99426e+02    1.29480e+02    2.16280e-05

	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 10000001 steps using 10000001 frames

    Energies (kJ/mol)
        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
     9.45818e+03    4.30665e+03    2.92407e+03   -1.75556e+02    8.02473e+04
         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
     1.01284e+06   -7.13138e+06    2.01510e+04   -6.00163e+06    1.08407e+06
    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
    -4.91756e+06   -5.38519e+06    2.99998e+02    1.37549e+02    0.00000e+00

    Total Virial (kJ/mol)
     3.42887e+05    1.63625e+01    1.23658e+02
     1.67406e+01    3.42916e+05   -4.27834e+01
     1.23997e+02   -4.29636e+01    3.42881e+05

    Pressure (bar)
     1.37573e+02    7.50214e-02   -1.03916e-01
     7.22048e-02    1.37623e+02   -1.66417e-02
    -1.06444e-01   -1.52990e-02    1.37453e+02

	M E G A - F L O P S   A C C O U N T I N G

  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
  V&F=Potential and force  V=Potential only  F=Force only

  Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
  Pair Search distance check        16343508.605344   147091577.448     0.0
  NxN Ewald Elec. + LJ [V&F]       5072118956.506304 542716728346.174    98.1
  1,4 nonbonded interactions           95860.009586     8627400.863     0.0
  Calc Weights                      13039741.303974   469430686.943     0.1
  Spread Q Bspline                 278181147.818112   556362295.636     0.1
  Gather F Bspline                 278181147.818112  1669086886.909     0.3
  3D-FFT                           880787450.909824  7046299607.279     1.3
  Solve PME                           163837.909504    10485626.208     0.0
  Shift-X                             108664.934658      651989.608     0.0
  Angles                               86090.008609    14463121.446     0.0
  Propers                              31380.003138     7186020.719     0.0
  Impropers                            28790.002879     5988320.599     0.0
  Virial                             4347030.434703    78246547.825     0.0
  Stop-CM                            4346580.869316    43465808.693     0.0
  Calc-Ekin                          4346580.869316   117357683.472     0.0
  Lincs                                59130.017739     3547801.064     0.0
  Lincs-Mat                          1033080.309924     4132321.240     0.0
  Constraint-V                       4406580.881316    35252647.051     0.0
  Constraint-Vir                     4347450.434745   104338810.434     0.0
  Settle                             1429440.428832   461709258.513     0.1
-----------------------------------------------------------------------------
  Total                                             553500452758.122   100.0
-----------------------------------------------------------------------------

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 32 OpenMP threads

  Computing:          Num   Num      Call    Wall time         Giga-Cycles
                      Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
  Neighbor search        1   32     250001    6231.657     518475.694   1.1
  Launch GPU ops.        1   32   10000001    1825.689     151897.833   0.3
  Force                  1   32   10000001   49568.959    4124152.027   8.4
  PME mesh               1   32   10000001  194798.850   16207321.863  32.8
  Wait GPU local         1   32   10000001  170272.438   14166717.115  28.7
  NB X/F buffer ops.     1   32   19750001   29175.632    2427421.177   4.9
  Write traj.            1   32      20635    1567.928     130452.056   0.3
  Update                 1   32   10000001   13312.819    1107630.452   2.2
  Constraints            1   32   10000001   34210.142    2846293.908   5.8
  Rest                                       92338.781    7682613.897  15.6
-----------------------------------------------------------------------------
  Total                                     593302.894   49362976.023 100.0
-----------------------------------------------------------------------------
  Breakdown of PME mesh computation
-----------------------------------------------------------------------------
  PME spread/gather      1   32   20000002  144767.207   12044674.424  24.4
  PME 3D-FFT             1   32   20000002   39499.157    3286341.501   6.7
  PME solve Elec         1   32   10000001    9947.340     827621.589   1.7
-----------------------------------------------------------------------------

  GPU timings
-----------------------------------------------------------------------------
  Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
  Pair list H2D                     250001     935.751        3.743     0.2
  X / q H2D                       10000001   11509.209        1.151     2.8
  Nonbonded F+ene k.               9750000  377111.949       38.678    92.0
  Nonbonded F+ene+prune k.          250001   12049.010       48.196     2.9
  F D2H                           10000001    8129.292        0.813     2.0
-----------------------------------------------------------------------------
  Total                                     409735.211       40.974   100.0
-----------------------------------------------------------------------------

Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
For optimal performance this ratio should be close to 1!

NOTE: The GPU has >20% more load than the CPU. This imbalance causes
       performance loss, consider using a shorter cut-off and a finer PME grid.

                Core t (s)   Wall t (s)        (%)
        Time: 18713831.228   593302.894     3154.2
                          6d20h48:22
                  (ns/day)    (hour/ns)
Performance:        2.913        8.240
Finished mdrun on rank 0 Mon Dec 29 13:16:24 2014

-------------------------------------------------------
thank you in advance
Carmen

-- 
Carmen Di Giovanni, PhD
Dept. of Pharmaceutical and Toxicological Chemistry
"Drug Discovery Lab"
University of Naples "Federico II"
Via D. Montesano, 49
80131 Naples
Tel.: ++39 081 678623
Fax: ++39 081 678100
Email: cdigiova at unina.it

Quoting Szilárd Páll <pall.szilard at gmail.com>:

> We need a *full* log file, not parts of it!
>
> You can try running with "-ntomp 16 -pin on" - it may be a bit faster
> not not use HyperThreading.
> --
> Szilárd
>
>
> On Wed, Feb 18, 2015 at 5:20 PM, Carmen Di Giovanni  
> <cdigiova at unina.it> wrote:
>> Justin,
>> the problem is evident for all calculations.
>> This is the log file  of a recent run:
>>
>> --------------------------------------------------------------------------------
>>
>> Log file opened on Mon Dec 22 16:28:00 2014
>> Host: localhost.localdomain  pid: 8378  rank ID: 0  number of ranks:  1
>> GROMACS:    gmx mdrun, VERSION 5.0
>>
>> GROMACS is written by:
>> Emile Apol         Rossen Apostolov   Herman J.C. Berendsen Par Bjelkmar
>> Aldert van Buuren  Rudi van Drunen    Anton Feenstra     Sebastian Fritsch
>> Gerrit Groenhof    Christoph Junghans Peter Kasson       Carsten Kutzner
>> Per Larsson        Justin A. Lemkul   Magnus Lundborg    Pieter Meulenhoff
>> Erik Marklund      Teemu Murtola      Szilard Pall       Sander Pronk
>> Roland Schulz      Alexey Shvetsov    Michael Shirts     Alfons Sijbers
>> Peter Tieleman     Christian Wennberg Maarten Wolf
>> and the project leaders:
>> Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
>>
>> Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>> Copyright (c) 2001-2014, The GROMACS development team at
>> Uppsala University, Stockholm University and
>> the Royal Institute of Technology, Sweden.
>> check out http://www.gromacs.org for more information.
>>
>> GROMACS is free software; you can redistribute it and/or modify it
>> under the terms of the GNU Lesser General Public License
>> as published by the Free Software Foundation; either version 2.1
>> of the License, or (at your option) any later version.
>>
>> GROMACS:      gmx mdrun, VERSION 5.0
>> Executable:   /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
>> Library dir:  /opt/SW/gromacs-5.0/share/top
>> Command line:
>>   gmx_mpi mdrun -deffnm prod_20ns
>>
>> Gromacs version:    VERSION 5.0
>> Precision:          single
>> Memory model:       64 bit
>> MPI library:        MPI
>> OpenMP support:     enabled
>> GPU support:        enabled
>> invsqrt routine:    gmx_software_invsqrt(x)
>> SIMD instructions:  AVX_256
>> FFT library:        fftw-3.3.3-sse2
>> RDTSCP usage:       enabled
>> C++11 compilation:  disabled
>> TNG support:        enabled
>> Tracing support:    disabled
>> Built on:           Thu Jul 31 18:30:37 CEST 2014
>> Built by:           root at localhost.localdomain [CMAKE]
>> Build OS/arch:      Linux 2.6.32-431.el6.x86_64 x86_64
>> Build CPU vendor:   GenuineIntel
>> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>> Build CPU family:   6   Model: 62   Stepping: 4
>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx
>> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
>> sse4.1 sse4.2 ssse3 tdt x2apic
>> C compiler:         /usr/bin/cc GNU 4.4.7
>> C compiler flags:    -mavx   -Wno-maybe-uninitialized -Wextra
>> -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
>> -Wno-unused -Wunused-value -Wunused-parameter   -fomit-frame-pointer
>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>> C++ compiler:       /usr/bin/c++ GNU 4.4.7
>> C++ compiler flags:  -mavx   -Wextra -Wno-missing-field-initializers
>> -Wpointer-arith -Wall -Wno-unused-function   -fomit-frame-pointer
>> -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
>> Boost version:      1.55.0 (internal)
>> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
>> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
>> Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
>> CUDA compiler
>> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC
>> ;
>> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
>> CUDA driver:        6.50
>> CUDA runtime:       6.0
>>
>>
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
>> GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
>> molecular simulation
>> J. Chem. Theory Comput. 4 (2008) pp. 435-447
>> -------- -------- --- Thank You --- -------- --------
>>
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
>> Berendsen
>> GROMACS: Fast, Flexible and Free
>> J. Comp. Chem. 26 (2005) pp. 1701-1719
>> -------- -------- --- Thank You --- -------- --------
>>
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> E. Lindahl and B. Hess and D. van der Spoel
>> GROMACS 3.0: A package for molecular simulation and trajectory analysis
>> J. Mol. Mod. 7 (2001) pp. 306-317
>> -------- -------- --- Thank You --- -------- --------
>>
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> H. J. C. Berendsen, D. van der Spoel and R. van Drunen
>> GROMACS: A message-passing parallel molecular dynamics implementation
>> Comp. Phys. Comm. 91 (1995) pp. 43-56
>> -------- -------- --- Thank You --- -------- --------
>>
>>
>> For optimal performance with a GPU nstlist (now 10) should be larger.
>> The optimum depends on your CPU and GPU resources.
>> You might want to try several nstlist values.
>> Changing nstlist from 10 to 40, rlist from 1.2 to 1.285
>>
>> Input Parameters:
>>    integrator                     = md
>>    tinit                          = 0
>>    dt                             = 0.002
>>    nsteps                         = 10000000
>>    init-step                      = 0
>>    simulation-part                = 1
>>    comm-mode                      = Linear
>>    nstcomm                        = 1
>>    bd-fric                        = 0
>>    ld-seed                        = 1993
>>    emtol                          = 10
>>    emstep                         = 0.01
>>    niter                          = 20
>>    fcstep                         = 0
>>    nstcgsteep                     = 1000
>>    nbfgscorr                      = 10
>>    rtpi                           = 0.05
>>    nstxout                        = 2500
>>    nstvout                        = 2500
>>    nstfout                        = 0
>>    nstlog                         = 2500
>>    nstcalcenergy                  = 1
>>    nstenergy                      = 2500
>>    nstxout-compressed             = 500
>>    compressed-x-precision         = 1000
>>    cutoff-scheme                  = Verlet
>>    nstlist                        = 40
>>    ns-type                        = Grid
>>    pbc                            = xyz
>>    periodic-molecules             = FALSE
>>    verlet-buffer-tolerance        = 0.005
>>    rlist                          = 1.285
>>    rlistlong                      = 1.285
>>    nstcalclr                      = 10
>>    coulombtype                    = PME
>>    coulomb-modifier               = Potential-shift
>>    rcoulomb-switch                = 0
>>    rcoulomb                       = 1.2
>>    epsilon-r                      = 1
>>    epsilon-rf                     = 1
>>    vdw-type                       = Cut-off
>>    vdw-modifier                   = Potential-shift
>>    rvdw-switch                    = 0
>>    rvdw                           = 1.2
>>    DispCorr                       = No
>>    table-extension                = 1
>>    fourierspacing                 = 0.135
>>    fourier-nx                     = 128
>>    fourier-ny                     = 128
>>    fourier-nz                     = 128
>>    pme-order                      = 4
>>    ewald-rtol                     = 1e-05
>>    ewald-rtol-lj                  = 0.001
>>    lj-pme-comb-rule               = Geometric
>>    ewald-geometry                 = 0
>>    epsilon-surface                = 0
>>    implicit-solvent               = No
>>    gb-algorithm                   = Still
>>    nstgbradii                     = 1
>>    rgbradii                       = 2
>>    gb-epsilon-solvent             = 80
>>    gb-saltconc                    = 0
>>    gb-obc-alpha                   = 1
>>    gb-obc-beta                    = 0.8
>>    gb-obc-gamma                   = 4.85
>>    gb-dielectric-offset           = 0.009
>>    sa-algorithm                   = Ace-approximation
>>    sa-surface-tension             = 2.092
>>    tcoupl                         = V-rescale
>>    nsttcouple                     = 10
>>    nh-chain-length                = 0
>>    print-nose-hoover-chain-variables = FALSE
>>    pcoupl                         = No
>>    pcoupltype                     = Semiisotropic
>>    nstpcouple                     = -1
>>    tau-p                          = 0.5
>>    compressibility (3x3):
>>       compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>    ref-p (3x3):
>>       ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>    refcoord-scaling               = No
>>    posres-com (3):
>>       posres-com[0]= 0.00000e+00
>>       posres-com[1]= 0.00000e+00
>>       posres-com[2]= 0.00000e+00
>>    posres-comB (3):
>>       posres-comB[0]= 0.00000e+00
>>       posres-comB[1]= 0.00000e+00
>>       posres-comB[2]= 0.00000e+00
>>    QMMM                           = FALSE
>>    QMconstraints                  = 0
>>    QMMMscheme                     = 0
>>    MMChargeScaleFactor            = 1
>> qm-opts:
>>    ngQM                           = 0
>>    constraint-algorithm           = Lincs
>>    continuation                   = FALSE
>>    Shake-SOR                      = FALSE
>>    shake-tol                      = 0.0001
>>    lincs-order                    = 4
>>    lincs-iter                     = 1
>>    lincs-warnangle                = 30
>>    nwall                          = 0
>>    wall-type                      = 9-3
>>    wall-r-linpot                  = -1
>>    wall-atomtype[0]               = -1
>>    wall-atomtype[1]               = -1
>>    wall-density[0]                = 0
>>    wall-density[1]                = 0
>>    wall-ewald-zfac                = 3
>>    pull                           = no
>>    rotation                       = FALSE
>>    interactiveMD                  = FALSE
>>    disre                          = No
>>    disre-weighting                = Conservative
>>    disre-mixed                    = FALSE
>>    dr-fc                          = 1000
>>    dr-tau                         = 0
>>    nstdisreout                    = 100
>>    orire-fc                       = 0
>>    orire-tau                      = 0
>>    nstorireout                    = 100
>>    free-energy                    = no
>>    cos-acceleration               = 0
>>    deform (3x3):
>>       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>>    simulated-tempering            = FALSE
>>    E-x:
>>       n = 0
>>    E-xt:
>>       n = 0
>>    E-y:
>>       n = 0
>>    E-yt:
>>       n = 0
>>    E-z:
>>       n = 0
>>    E-zt:
>>       n = 0
>>    swapcoords                     = no
>>    adress                         = FALSE
>>    userint1                       = 0
>>    userint2                       = 0
>>    userint3                       = 0
>>    userint4                       = 0
>>    userreal1                      = 0
>>    userreal2                      = 0
>>    userreal3                      = 0
>>    userreal4                      = 0
>> grpopts:
>>    nrdf:      869226
>>    ref-t:         300
>>    tau-t:         0.1
>> annealing:          No
>> annealing-npoints:           0
>>    acc:            0           0           0
>>    nfreeze:           N           N           N
>>    energygrp-flags[  0]: 0
>> Using 1 MPI process
>> Using 32 OpenMP threads
>>
>> Detecting CPU SIMD instructions.
>> Present hardware specification:
>> Vendor: GenuineIntel
>> Brand:  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>> Family:  6  Model: 62  Stepping:  4
>> Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr
>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3
>> sse4.1 sse4.2 ssse3 tdt x2apic
>> SIMD instructions most likely to fit this hardware: AVX_256
>> SIMD instructions selected at GROMACS compile time: AVX_256
>>
>>
>> 2 GPUs detected on host localhost.localdomain:
>>   #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
>>   #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat: compatible
>>
>> 1 GPU auto-selected for this run.
>> Mapping of GPU to the 1 PP rank in this node: #0
>>
>>
>> NOTE: potentially sub-optimal launch configuration, gmx_mpi started with
>> less
>>       PP MPI process per node than GPUs available.
>>       Each PP MPI process can use only one GPU, 1 GPU per node will be used.
>>
>> Will do PME sum in reciprocal space for electrostatic interactions.
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
>> A smooth particle mesh Ewald method
>> J. Chem. Phys. 103 (1995) pp. 8577-8592
>> -------- -------- --- Thank You --- -------- --------
>>
>> Will do ordinary reciprocal space Ewald sum.
>> Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
>> Cut-off's:   NS: 1.285   Coulomb: 1.2   LJ: 1.2
>> System total charge: -0.012
>> Generated table with 1142 data points for Ewald.
>> Tabscale = 500 points/nm
>> Generated table with 1142 data points for LJ6.
>> Tabscale = 500 points/nm
>> Generated table with 1142 data points for LJ12.
>> Tabscale = 500 points/nm
>> Generated table with 1142 data points for 1-4 COUL.
>> Tabscale = 500 points/nm
>> Generated table with 1142 data points for 1-4 LJ6.
>> Tabscale = 500 points/nm
>> Generated table with 1142 data points for 1-4 LJ12.
>> Tabscale = 500 points/nm
>>
>> Using CUDA 8x8 non-bonded kernels
>>
>> Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
>> Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size: 1536
>>
>> Removing pbc first time
>> Pinning threads with an auto-selected logical core stride of 1
>>
>> Initializing LINear Constraint Solver
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
>> LINCS: A Linear Constraint Solver for molecular simulations
>> J. Comp. Chem. 18 (1997) pp. 1463-1472
>> -------- -------- --- Thank You --- -------- --------
>>
>> The number of constraints is 5913
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> S. Miyamoto and P. A. Kollman
>> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
>> Water Models
>> J. Comp. Chem. 13 (1992) pp. 952-962
>> -------- -------- --- Thank You --- -------- --------
>>
>> Center of mass motion removal mode is Linear
>> We have the following groups for center of mass motion removal:
>>   0:  rest
>>
>> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
>> G. Bussi, D. Donadio and M. Parrinello
>> Canonical sampling through velocity rescaling
>> J. Chem. Phys. 126 (2007) pp. 014101
>> -------- -------- --- Thank You --- -------- --------
>>
>> There are: 434658 Atoms
>>
>> Constraining the starting coordinates (step 0)
>>
>> Constraining the coordinates at t0-dt (step 0)
>> RMS relative constraint deviation after constraining: 3.67e-05
>> Initial temperature: 300.5 K
>>
>> Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
>>            Step           Time         Lambda
>>               0        0.00000        0.00000
>>
>>    Energies (kJ/mol)
>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>>     9.74139e+03    4.34956e+03    2.97359e+03   -1.93107e+02    8.05534e+04
>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>>     1.01340e+06   -7.13271e+06    2.01361e+04   -6.00175e+06    1.09887e+06
>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>>    -4.90288e+06   -4.90288e+06    3.04092e+02    1.70897e+02    2.16683e-05
>>
>> step   80: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6279.0
>> M-cycles
>> step  160: timed with pme grid 112 112 112, coulomb cutoff 1.306: 6962.2
>> M-cycles
>> step  240: timed with pme grid 100 100 100, coulomb cutoff 1.463: 8406.5
>> M-cycles
>> step  320: timed with pme grid 128 128 128, coulomb cutoff 1.200: 6424.0
>> M-cycles
>> step  400: timed with pme grid 120 120 120, coulomb cutoff 1.219: 6369.1
>> M-cycles
>> step  480: timed with pme grid 112 112 112, coulomb cutoff 1.306: 7309.0
>> M-cycles
>> step  560: timed with pme grid 108 108 108, coulomb cutoff 1.355: 7521.2
>> M-cycles
>> step  640: timed with pme grid 104 104 104, coulomb cutoff 1.407: 8369.8
>> M-cycles
>>               optimal pme grid 128 128 128, coulomb cutoff 1.200
>>            Step           Time         Lambda
>>            2500        5.00000        0.00000
>>
>>    Energies (kJ/mol)
>>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>>     9.72545e+03    4.33046e+03    2.98087e+03   -1.95794e+02    8.05967e+04
>>         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
>>     1.01293e+06   -7.13110e+06    2.01689e+04   -6.00057e+06    1.08489e+06
>>    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
>>    -4.91567e+06   -4.90300e+06    3.00225e+02    1.36173e+02    2.25998e-05
>>
>>            Step           Time         Lambda
>>            5000       10.00000        0.00000
>>
>> ............
>>
>> -------------------------------------------------------------------------------
>>
>>
>> Thank you in advance
>>
>> --
>> Carmen Di Giovanni, PhD
>> Dept. of Pharmaceutical and Toxicological Chemistry
>> "Drug Discovery Lab"
>> University of Naples "Federico II"
>> Via D. Montesano, 49
>> 80131 Naples
>> Tel.: ++39 081 678623
>> Fax: ++39 081 678100
>> Email: cdigiova at unina.it
>>
>>
>>
>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>
>>>
>>>
>>> On 2/18/15 11:09 AM, Barnett, James W wrote:
>>>>
>>>> What's your exact command?
>>>>
>>>
>>> A full .log file would be even better; it would tell us everything we need
>>> to know :)
>>>
>>> -Justin
>>>
>>>> Have you reviewed this page:
>>>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>>>
>>>> James "Wes" Barnett
>>>> Ph.D. Candidate
>>>> Chemical and Biomolecular Engineering
>>>>
>>>> Tulane University
>>>> Boggs Center for Energy and Biotechnology, Room 341-B
>>>>
>>>> ________________________________________
>>>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
>>>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Carmen Di
>>>> Giovanni <cdigiova at unina.it>
>>>> Sent: Wednesday, February 18, 2015 10:06 AM
>>>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>>>> Subject: Re: [gmx-users] GPU low performance
>>>>
>>>> I post the message of a md run :
>>>>
>>>>
>>>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>>>> For optimal performance this ratio should be close to 1!
>>>>
>>>>
>>>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>>>        performance loss, consider using a shorter cut-off and a finer PME
>>>> grid.
>>>>
>>>> As can I solved this problem ?
>>>> Thank you in advance
>>>>
>>>>
>>>> --
>>>> Carmen Di Giovanni, PhD
>>>> Dept. of Pharmaceutical and Toxicological Chemistry
>>>> "Drug Discovery Lab"
>>>> University of Naples "Federico II"
>>>> Via D. Montesano, 49
>>>> 80131 Naples
>>>> Tel.: ++39 081 678623
>>>> Fax: ++39 081 678100
>>>> Email: cdigiova at unina.it
>>>>
>>>>
>>>>
>>>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>>>
>>>>>
>>>>>
>>>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>>>>
>>>>>> Daear all,
>>>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>>>
>>>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>>>> For optimal performance this ratio should be close to 1!
>>>>>>
>>>>>
>>>>> Minimization is a poor indicator of performance.  Do a real MD run.
>>>>>
>>>>>>
>>>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>>>> performance loss.
>>>>>>
>>>>>> Core t (s) Wall t (s) (%)
>>>>>> Time: 3289.010 205.891 1597.4
>>>>>> (steps/hour)
>>>>>> Performance: 8480.2
>>>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>>>
>>>>>>
>>>>>> Cai I improve the performance?
>>>>>> At the moment in the forum I didn't full informations to solve this
>>>>>> problem.
>>>>>> In attachment there is the log. file
>>>>>>
>>>>>
>>>>> The list does not accept attachments.  If you wish to share a file,
>>>>> upload it to a file-sharing service and provide a URL.  The full
>>>>> .log is quite important for understanding your hardware,
>>>>> optimizations, and seeing full details of the performance breakdown.
>>>>>  But again, base your assessment on MD, not EM.
>>>>>
>>>>> -Justin
>>>>>
>>>>> --
>>>>> ==================================================
>>>>>
>>>>> Justin A. Lemkul, Ph.D.
>>>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>>>
>>>>> Department of Pharmaceutical Sciences
>>>>> School of Pharmacy
>>>>> Health Sciences Facility II, Room 629
>>>>> University of Maryland, Baltimore
>>>>> 20 Penn St.
>>>>> Baltimore, MD 21201
>>>>>
>>>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>>>> http://mackerell.umaryland.edu/~jalemkul
>>>>>
>>>>> ==================================================
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>> or send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  
>>>> before posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>>>
>>> --
>>> ==================================================
>>>
>>> Justin A. Lemkul, Ph.D.
>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>
>>> Department of Pharmaceutical Sciences
>>> School of Pharmacy
>>> Health Sciences Facility II, Room 629
>>> University of Maryland, Baltimore
>>> 20 Penn St.
>>> Baltimore, MD 21201
>>>
>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>> http://mackerell.umaryland.edu/~jalemkul
>>>
>>> ==================================================
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>>> a mail to gmx-users-request at gromacs.org.
>>>
>>>
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>> mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at  
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before  
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
> or send a mail to gmx-users-request at gromacs.org.
>