[gmx-users] GPU low performance

Wed Feb 18 17:20:52 CET 2015

Justin,
the problem is evident for all calculations.
This is the log file  of a recent run:

--------------------------------------------------------------------------------

Log file opened on Mon Dec 22 16:28:00 2014
Host: localhost.localdomain  pid: 8378  rank ID: 0  number of ranks:  1
GROMACS:    gmx mdrun, VERSION 5.0

GROMACS is written by:
Emile Apol         Rossen Apostolov   Herman J.C. Berendsen Par Bjelkmar
Aldert van Buuren  Rudi van Drunen    Anton Feenstra     Sebastian Fritsch
Gerrit Groenhof    Christoph Junghans Peter Kasson       Carsten Kutzner
Per Larsson        Justin A. Lemkul   Magnus Lundborg    Pieter Meulenhoff
Erik Marklund      Teemu Murtola      Szilard Pall       Sander Pronk
Roland Schulz      Alexey Shvetsov    Michael Shirts     Alfons Sijbers
Peter Tieleman     Christian Wennberg Maarten Wolf
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2014, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, VERSION 5.0
Executable:   /opt/SW/gromacs-5.0/build/mpi-cuda/bin/gmx_mpi
Library dir:  /opt/SW/gromacs-5.0/share/top
Command line:
   gmx_mpi mdrun -deffnm prod_20ns

Gromacs version:    VERSION 5.0
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled
GPU support:        enabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.3-sse2
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Thu Jul 31 18:30:37 CEST 2014
Built by:           root at localhost.localdomain [CMAKE]
Build OS/arch:      Linux 2.6.32-431.el6.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Build CPU family:   6   Model: 62   Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm  
mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp  
sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.4.7
C compiler flags:    -mavx   -Wno-maybe-uninitialized -Wextra  
-Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith  
-Wall -Wno-unused -Wunused-value -Wunused-parameter    
-fomit-frame-pointer -funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 4.4.7
C++ compiler flags:  -mavx   -Wextra -Wno-missing-field-initializers  
-Wpointer-arith -Wall -Wno-unused-function   -fomit-frame-pointer  
-funroll-all-loops  -Wno-array-bounds  -O3 -DNDEBUG
Boost version:      1.55.0 (internal)
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda  
compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on  
Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
CUDA compiler  
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC ;  
;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-fomit-frame-pointer;-funroll-all-loops;-Wno-array-bounds;-O3;-DNDEBUG
CUDA driver:        6.50
CUDA runtime:       6.0

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 40, rlist from 1.2 to 1.285

Input Parameters:
    integrator                     = md
    tinit                          = 0
    dt                             = 0.002
    nsteps                         = 10000000
    init-step                      = 0
    simulation-part                = 1
    comm-mode                      = Linear
    nstcomm                        = 1
    bd-fric                        = 0
    ld-seed                        = 1993
    emtol                          = 10
    emstep                         = 0.01
    niter                          = 20
    fcstep                         = 0
    nstcgsteep                     = 1000
    nbfgscorr                      = 10
    rtpi                           = 0.05
    nstxout                        = 2500
    nstvout                        = 2500
    nstfout                        = 0
    nstlog                         = 2500
    nstcalcenergy                  = 1
    nstenergy                      = 2500
    nstxout-compressed             = 500
    compressed-x-precision         = 1000
    cutoff-scheme                  = Verlet
    nstlist                        = 40
    ns-type                        = Grid
    pbc                            = xyz
    periodic-molecules             = FALSE
    verlet-buffer-tolerance        = 0.005
    rlist                          = 1.285
    rlistlong                      = 1.285
    nstcalclr                      = 10
    coulombtype                    = PME
    coulomb-modifier               = Potential-shift
    rcoulomb-switch                = 0
    rcoulomb                       = 1.2
    epsilon-r                      = 1
    epsilon-rf                     = 1
    vdw-type                       = Cut-off
    vdw-modifier                   = Potential-shift
    rvdw-switch                    = 0
    rvdw                           = 1.2
    DispCorr                       = No
    table-extension                = 1
    fourierspacing                 = 0.135
    fourier-nx                     = 128
    fourier-ny                     = 128
    fourier-nz                     = 128
    pme-order                      = 4
    ewald-rtol                     = 1e-05
    ewald-rtol-lj                  = 0.001
    lj-pme-comb-rule               = Geometric
    ewald-geometry                 = 0
    epsilon-surface                = 0
    implicit-solvent               = No
    gb-algorithm                   = Still
    nstgbradii                     = 1
    rgbradii                       = 2
    gb-epsilon-solvent             = 80
    gb-saltconc                    = 0
    gb-obc-alpha                   = 1
    gb-obc-beta                    = 0.8
    gb-obc-gamma                   = 4.85
    gb-dielectric-offset           = 0.009
    sa-algorithm                   = Ace-approximation
    sa-surface-tension             = 2.092
    tcoupl                         = V-rescale
    nsttcouple                     = 10
    nh-chain-length                = 0
    print-nose-hoover-chain-variables = FALSE
    pcoupl                         = No
    pcoupltype                     = Semiisotropic
    nstpcouple                     = -1
    tau-p                          = 0.5
    compressibility (3x3):
       compressibility[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       compressibility[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       compressibility[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
    ref-p (3x3):
       ref-p[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       ref-p[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       ref-p[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
    refcoord-scaling               = No
    posres-com (3):
       posres-com[0]= 0.00000e+00
       posres-com[1]= 0.00000e+00
       posres-com[2]= 0.00000e+00
    posres-comB (3):
       posres-comB[0]= 0.00000e+00
       posres-comB[1]= 0.00000e+00
       posres-comB[2]= 0.00000e+00
    QMMM                           = FALSE
    QMconstraints                  = 0
    QMMMscheme                     = 0
    MMChargeScaleFactor            = 1
qm-opts:
    ngQM                           = 0
    constraint-algorithm           = Lincs
    continuation                   = FALSE
    Shake-SOR                      = FALSE
    shake-tol                      = 0.0001
    lincs-order                    = 4
    lincs-iter                     = 1
    lincs-warnangle                = 30
    nwall                          = 0
    wall-type                      = 9-3
    wall-r-linpot                  = -1
    wall-atomtype[0]               = -1
    wall-atomtype[1]               = -1
    wall-density[0]                = 0
    wall-density[1]                = 0
    wall-ewald-zfac                = 3
    pull                           = no
    rotation                       = FALSE
    interactiveMD                  = FALSE
    disre                          = No
    disre-weighting                = Conservative
    disre-mixed                    = FALSE
    dr-fc                          = 1000
    dr-tau                         = 0
    nstdisreout                    = 100
    orire-fc                       = 0
    orire-tau                      = 0
    nstorireout                    = 100
    free-energy                    = no
    cos-acceleration               = 0
    deform (3x3):
       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
    simulated-tempering            = FALSE
    E-x:
       n = 0
    E-xt:
       n = 0
    E-y:
       n = 0
    E-yt:
       n = 0
    E-z:
       n = 0
    E-zt:
       n = 0
    swapcoords                     = no
    adress                         = FALSE
    userint1                       = 0
    userint2                       = 0
    userint3                       = 0
    userint4                       = 0
    userreal1                      = 0
    userreal2                      = 0
    userreal3                      = 0
    userreal4                      = 0
grpopts:
    nrdf:      869226
    ref-t:         300
    tau-t:         0.1
annealing:          No
annealing-npoints:           0
    acc:	           0           0           0
    nfreeze:           N           N           N
    energygrp-flags[  0]: 0
Using 1 MPI process
Using 32 OpenMP threads

Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Family:  6  Model: 62  Stepping:  4
Features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr  
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2  
sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX_256
SIMD instructions selected at GROMACS compile time: AVX_256

2 GPUs detected on host localhost.localdomain:
   #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
   #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat: compatible

1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0

NOTE: potentially sub-optimal launch configuration, gmx_mpi started with less
       PP MPI process per node than GPUs available.
       Each PP MPI process can use only one GPU, 1 GPU per node will be used.

Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Cut-off's:   NS: 1.285   Coulomb: 1.2   LJ: 1.2
System total charge: -0.012
Generated table with 1142 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1142 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1142 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1142 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using CUDA 8x8 non-bonded kernels

Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 7.82e-04 size: 1536

Removing pbc first time
Pinning threads with an auto-selected logical core stride of 1

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------

The number of constraints is 5913

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
   0:  rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- --- Thank You --- -------- --------

There are: 434658 Atoms

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 3.67e-05
Initial temperature: 300.5 K

Started mdrun on rank 0 Mon Dec 22 16:28:01 2014
            Step           Time         Lambda
               0        0.00000        0.00000

    Energies (kJ/mol)
        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
     9.74139e+03    4.34956e+03    2.97359e+03   -1.93107e+02    8.05534e+04
         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
     1.01340e+06   -7.13271e+06    2.01361e+04   -6.00175e+06    1.09887e+06
    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
    -4.90288e+06   -4.90288e+06    3.04092e+02    1.70897e+02    2.16683e-05

step   80: timed with pme grid 128 128 128, coulomb cutoff 1.200:  
6279.0 M-cycles
step  160: timed with pme grid 112 112 112, coulomb cutoff 1.306:  
6962.2 M-cycles
step  240: timed with pme grid 100 100 100, coulomb cutoff 1.463:  
8406.5 M-cycles
step  320: timed with pme grid 128 128 128, coulomb cutoff 1.200:  
6424.0 M-cycles
step  400: timed with pme grid 120 120 120, coulomb cutoff 1.219:  
6369.1 M-cycles
step  480: timed with pme grid 112 112 112, coulomb cutoff 1.306:  
7309.0 M-cycles
step  560: timed with pme grid 108 108 108, coulomb cutoff 1.355:  
7521.2 M-cycles
step  640: timed with pme grid 104 104 104, coulomb cutoff 1.407:  
8369.8 M-cycles
               optimal pme grid 128 128 128, coulomb cutoff 1.200
            Step           Time         Lambda
            2500        5.00000        0.00000

    Energies (kJ/mol)
        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
     9.72545e+03    4.33046e+03    2.98087e+03   -1.95794e+02    8.05967e+04
         LJ (SR)   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
     1.01293e+06   -7.13110e+06    2.01689e+04   -6.00057e+06    1.08489e+06
    Total Energy  Conserved En.    Temperature Pressure (bar)   Constr. rmsd
    -4.91567e+06   -4.90300e+06    3.00225e+02    1.36173e+02    2.25998e-05

            Step           Time         Lambda
            5000       10.00000        0.00000

............

-------------------------------------------------------------------------------

Thank you in advance

-- 
Carmen Di Giovanni, PhD
Dept. of Pharmaceutical and Toxicological Chemistry
"Drug Discovery Lab"
University of Naples "Federico II"
Via D. Montesano, 49
80131 Naples
Tel.: ++39 081 678623
Fax: ++39 081 678100
Email: cdigiova at unina.it

Quoting Justin Lemkul <jalemkul at vt.edu>:

>
>
> On 2/18/15 11:09 AM, Barnett, James W wrote:
>> What's your exact command?
>>
>
> A full .log file would be even better; it would tell us everything  
> we need to know :)
>
> -Justin
>
>> Have you reviewed this page:  
>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
>>
>> James "Wes" Barnett
>> Ph.D. Candidate
>> Chemical and Biomolecular Engineering
>>
>> Tulane University
>> Boggs Center for Energy and Biotechnology, Room 341-B
>>
>> ________________________________________
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se  
>> <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of  
>> Carmen Di Giovanni <cdigiova at unina.it>
>> Sent: Wednesday, February 18, 2015 10:06 AM
>> To: gromacs.org_gmx-users at maillist.sys.kth.se
>> Subject: Re: [gmx-users] GPU low performance
>>
>> I post the message of a md run :
>>
>>
>> Force evaluation time GPU/CPU: 40.974 ms/24.437 ms = 1.677
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>        performance loss, consider using a shorter cut-off and a  
>> finer PME grid.
>>
>> As can I solved this problem ?
>> Thank you in advance
>>
>>
>> --
>> Carmen Di Giovanni, PhD
>> Dept. of Pharmaceutical and Toxicological Chemistry
>> "Drug Discovery Lab"
>> University of Naples "Federico II"
>> Via D. Montesano, 49
>> 80131 Naples
>> Tel.: ++39 081 678623
>> Fax: ++39 081 678100
>> Email: cdigiova at unina.it
>>
>>
>>
>> Quoting Justin Lemkul <jalemkul at vt.edu>:
>>
>>>
>>>
>>> On 2/18/15 10:30 AM, Carmen Di Giovanni wrote:
>>>> Daear all,
>>>> I'm working on a machine with an INVIDIA Teska K20.
>>>> After a minimization on a protein of 1925 atoms this is the mesage:
>>>>
>>>> Force evaluation time GPU/CPU: 2.923 ms/116.774 ms = 0.025
>>>> For optimal performance this ratio should be close to 1!
>>>>
>>>
>>> Minimization is a poor indicator of performance.  Do a real MD run.
>>>
>>>>
>>>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>>>> performance loss.
>>>>
>>>> Core t (s) Wall t (s) (%)
>>>> Time: 3289.010 205.891 1597.4
>>>> (steps/hour)
>>>> Performance: 8480.2
>>>> Finished mdrun on rank 0 Wed Feb 18 15:50:06 2015
>>>>
>>>>
>>>> Cai I improve the performance?
>>>> At the moment in the forum I didn't full informations to solve  
>>>> this problem.
>>>> In attachment there is the log. file
>>>>
>>>
>>> The list does not accept attachments.  If you wish to share a file,
>>> upload it to a file-sharing service and provide a URL.  The full
>>> .log is quite important for understanding your hardware,
>>> optimizations, and seeing full details of the performance breakdown.
>>>  But again, base your assessment on MD, not EM.
>>>
>>> -Justin
>>>
>>> --
>>> ==================================================
>>>
>>> Justin A. Lemkul, Ph.D.
>>> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>>>
>>> Department of Pharmaceutical Sciences
>>> School of Pharmacy
>>> Health Sciences Facility II, Room 629
>>> University of Maryland, Baltimore
>>> 20 Penn St.
>>> Baltimore, MD 21201
>>>
>>> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
>>> http://mackerell.umaryland.edu/~jalemkul
>>>
>>> ==================================================
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>> or send a mail to gmx-users-request at gromacs.org.
>>>
>>>
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at  
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before  
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>> or send a mail to gmx-users-request at gromacs.org.
>>
>
> -- 
> ==================================================
>
> Justin A. Lemkul, Ph.D.
> Ruth L. Kirschstein NRSA Postdoctoral Fellow
>
> Department of Pharmaceutical Sciences
> School of Pharmacy
> Health Sciences Facility II, Room 629
> University of Maryland, Baltimore
> 20 Penn St.
> Baltimore, MD 21201
>
> jalemkul at outerbanks.umaryland.edu | (410) 706-7441
> http://mackerell.umaryland.edu/~jalemkul
>
> ==================================================
> -- 
> Gromacs Users mailing list
>
> * Please search the archive at  
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before  
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
> or send a mail to gmx-users-request at gromacs.org.
>
>