[gmx-users] GROMACS performance with an NVidia Tesla k40c

Wed Jan 20 18:35:57 CET 2016

Hi,

thanks for your answer. :) I tried to attach in the previous email
the log file but for some reason it never arrived on the list. Actually I searched a bit more and the speed up reaches the theoretical maximum reported by Nvidia so probably I shouldn't be greedy. The log file cleared of all the reference notes and thermodynamic output is: 

Log file opened on Tue Jan 19 18:48:55 2016
Host: -  pid: 1215  rank ID: 0  number of ranks:  1

GROMACS:      gmx mdrun, VERSION 5.1.1
Executable:   /usr/local/gromacs/bin/gmx
Data prefix:  /usr/local/gromacs
Command line:
  gmx mdrun -v -deffnm npt-ini -gpu_id 0

GROMACS version:    VERSION 5.1.1
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.4-sse2-avx
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Tue Jan 19 18:00:50 GMT 2016
Built by:           - [CMAKE]
Build OS/arch:      Linux 3.13.0-74-generic x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
Build CPU family:   6   Model: 45   Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.8.4
C compiler flags:    -mavx    -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
C++ compiler:       /usr/bin/c++ GNU 4.8.4
C++ compiler flags:  -mavx    -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
Boost version:      1.55.0 (internal)
CUDA compiler:      /usr/local/cuda-7.5/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;; ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        7.50
CUDA runtime:       7.50

Running on 1 node with total 4 cores, 8 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
    Family:  6  model: 45  stepping:  7
    CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX_256
    SIMD instructions selected at GROMACS compile time: AVX_256
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA Tesla K40c, compute cap.: 3.5, ECC: yes, stat: compatible
    #1: NVIDIA Quadro 2000, compute cap.: 2.1, ECC:  no, stat: compatible

For optimal performance with a GPU nstlist (now 10) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 10 to 40, rlist from 1.2 to 1.235

Input Parameters:
   integrator                     = md
   tinit                          = 0
   dt                             = 0.002
   nsteps                         = 5000
   init-step                      = 0
   simulation-part                = 1
   comm-mode                      = Linear
   nstcomm                        = 100
   bd-fric                        = 0
   ld-seed                        = 555955623
   emtol                          = 10
   emstep                         = 0.01
   niter                          = 20
   fcstep                         = 0
   nstcgsteep                     = 1000
   nbfgscorr                      = 10
   rtpi                           = 0.05
   nstxout                        = 500
   nstvout                        = 500
   nstfout                        = 500
   nstlog                         = 500
   nstcalcenergy                  = 100
   nstenergy                      = 500
   nstxout-compressed             = 500
   compressed-x-precision         = 1000
   cutoff-scheme                  = Verlet
   nstlist                        = 40
   ns-type                        = Grid
   pbc                            = xyz
   periodic-molecules             = FALSE
   verlet-buffer-tolerance        = 0.005
   rlist                          = 1.235
   rlistlong                      = 1.235
   nstcalclr                      = 10
   coulombtype                    = PME
   coulomb-modifier               = Potential-shift
   rcoulomb-switch                = 0
   rcoulomb                       = 1.2
   epsilon-r                      = 1
   epsilon-rf                     = inf
   vdw-type                       = Cut-off
   vdw-modifier                   = Force-switch
   rvdw-switch                    = 1
   rvdw                           = 1.2
   DispCorr                       = No
   table-extension                = 1
   fourierspacing                 = 0.16
   fourier-nx                     = 44
   fourier-ny                     = 42
   fourier-nz                     = 42
   pme-order                      = 4
   ewald-rtol                     = 1e-05
   ewald-rtol-lj                  = 0.001
   lj-pme-comb-rule               = Geometric
   ewald-geometry                 = 0
   epsilon-surface                = 0
   implicit-solvent               = No
   gb-algorithm                   = Still
   nstgbradii                     = 1
   rgbradii                       = 1
   gb-epsilon-solvent             = 80
   gb-saltconc                    = 0
   gb-obc-alpha                   = 1
   gb-obc-beta                    = 0.8
   gb-obc-gamma                   = 4.85
   gb-dielectric-offset           = 0.009
   sa-algorithm                   = Ace-approximation
   sa-surface-tension             = 2.05016
   tcoupl                         = V-rescale
   nsttcouple                     = 10
   nh-chain-length                = 0
   print-nose-hoover-chain-variables = FALSE
   pcoupl                         = Parrinello-Rahman
   pcoupltype                     = Semiisotropic
   nstpcouple                     = 10
   tau-p                          = 2
   compressibility (3x3):
      compressibility[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compressibility[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compressibility[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   ref-p (3x3):
      ref-p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref-p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref-p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   refcoord-scaling               = No
   posres-com (3):
      posres-com[0]= 0.00000e+00
      posres-com[1]= 0.00000e+00
      posres-com[2]= 0.00000e+00
   posres-comB (3):
      posres-comB[0]= 0.00000e+00
      posres-comB[1]= 0.00000e+00
      posres-comB[2]= 0.00000e+00
   QMMM                           = FALSE
   QMconstraints                  = 0
   QMMMscheme                     = 0
   MMChargeScaleFactor            = 1
qm-opts:
   ngQM                           = 0
   constraint-algorithm           = Lincs
   continuation                   = FALSE
   Shake-SOR                      = FALSE
   shake-tol                      = 0.0001
   lincs-order                    = 4
   lincs-iter                     = 2
   lincs-warnangle                = 30
   nwall                          = 0
   wall-type                      = 9-3
   wall-r-linpot                  = -1
   wall-atomtype[0]               = -1
   wall-atomtype[1]               = -1
   wall-density[0]                = 0
   wall-density[1]                = 0
   wall-ewald-zfac                = 3
   pull                           = TRUE
   pull-cylinder-r                = 1.5
   pull-constr-tol                = 1e-06
   pull-print-COM1                = TRUE
   pull-print-COM2                = TRUE
   pull-print-ref-value           = FALSE
   pull-print-components          = FALSE
   pull-nstxout                   = 500
   pull-nstfout                   = 500
   pull-ngroups                   = 3
   pull-group 0:
     atom: not available
     weight: not available
     pbcatom                        = -1
   pull-group 1:
     atom (3):
        atom[0,...,2] = {30564,...,30566}
     weight: not available
     pbcatom                        = 30565
   pull-group 2:
     atom (17664):
        atom[0,...,17663] = {0,...,17663}
     weight: not available
     pbcatom                        = 8831
   pull-ncoords                   = 1
   pull-coord 0:
   group[0]                       = 1
   group[1]                       = 2
   type                           = constraint
   geometry                       = distance
   dim (3):
      dim[0]=0
      dim[1]=0
      dim[2]=1
   origin (3):
      origin[0]= 0.00000e+00
      origin[1]= 0.00000e+00
      origin[2]= 0.00000e+00
   vec (3):
      vec[0]= 0.00000e+00
      vec[1]= 0.00000e+00
      vec[2]= 0.00000e+00
   start                          = TRUE
   init                           = 0.000595327
   rate                           = 0
   k                              = 0
   kB                             = 0
   rotation                       = FALSE
   interactiveMD                  = FALSE
   disre                          = No
   disre-weighting                = Conservative
   disre-mixed                    = FALSE
   dr-fc                          = 1000
   dr-tau                         = 0
   nstdisreout                    = 100
   orire-fc                       = 0
   orire-tau                      = 0
   nstorireout                    = 100
   free-energy                    = no
   cos-acceleration               = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   simulated-tempering            = FALSE
   E-x:
      n = 0
   E-xt:
      n = 0
   E-y:
      n = 0
   E-yt:
      n = 0
   E-z:
      n = 0
   E-zt:
      n = 0
   swapcoords                     = no
   adress                         = FALSE
   userint1                       = 0
   userint2                       = 0
   userint3                       = 0
   userint4                       = 0
   userreal1                      = 0
   userreal2                      = 0
   userreal3                      = 0
   userreal4                      = 0
grpopts:
   nrdf:     25804.4     42237.6
   ref-t:         300         300
   tau-t:         0.1         0.1
annealing:          No          No
annealing-npoints:           0           0
   acc:	           0           0           0
   nfreeze:           N           N           N
   energygrp-flags[  0]: 0

Using 1 MPI thread
Using 8 OpenMP threads 

1 GPU user-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

Will do PME sum in reciprocal space for electrostatic interactions.

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.384195 nm for Ewald
Cut-off's:   NS: 1.235   Coulomb: 1.2   LJ: 1.2
System total charge: 0.000
Generated table with 1117 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1117 data points for LJ6Shift.
Tabscale = 500 points/nm
Generated table with 1117 data points for LJ12Shift.
Tabscale = 500 points/nm
Generated table with 1117 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1117 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1117 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Potential shift: LJ r^-12: -2.648e-01 r^-6: -5.349e-01, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176

Application clocks (GPU clocks) for Tesla K40c are (3004,875)

Using GPU 8x8 non-bonded kernels

Removing pbc first time
Pinning threads with an auto-selected logical core stride of 1

Will apply constraint COM pulling
with 1 pull coordinate and 3 groups
Pull group 1:     3 atoms, mass    18.015
Pull group 2: 17664 atoms, mass 100624.920

Initializing LINear Constraint Solver
The number of constraints is 10752

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
There are: 30567 Atoms

Constraining the starting coordinates (step 0)

Constraining the coordinates at t0-dt (step 0)
RMS relative constraint deviation after constraining: 9.36e-07
Initial temperature: 300.416 K

Started mdrun on rank 0 Tue Jan 19 18:48:56 2016
           Step           Time         Lambda
              0        0.00000        0.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.          LJ-14
    1.02711e+03    1.09345e+04    2.90167e+04    4.83622e+01    3.57667e+03
     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential
   -5.95455e+04   -2.14395e+04   -2.03396e+05    9.92446e+02   -2.38785e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
    8.50599e+04   -1.53725e+05    3.00707e+02   -4.25233e+03    9.46472e-07

step   80: timed with pme grid 44 42 42, coulomb cutoff 1.200: 663.9 M-cycles
step  160: timed with pme grid 36 36 36, coulomb cutoff 1.402: 653.5 M-cycles
step  240: timed with pme grid 32 32 32, coulomb cutoff 1.577: 684.3 M-cycles
step  320: timed with pme grid 28 28 28, coulomb cutoff 1.802: 974.3 M-cycles
step  400: timed with pme grid 32 32 28, coulomb cutoff 1.776: 938.4 M-cycles
step  480: timed with pme grid 32 32 32, coulomb cutoff 1.577: 689.4 M-cycles
           Step           Time         Lambda
            500        1.00000        0.00000

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.          LJ-14
    7.51593e+03    4.07377e+04    3.25875e+04    2.95374e+02    5.02562e+03
     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential
   -5.98575e+04   -9.86299e+03   -1.90656e+05    5.40848e+02   -1.73673e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
    8.42730e+04   -8.94003e+04    2.97925e+02   -2.14725e+01    9.49193e-07

step  560: timed with pme grid 36 36 32, coulomb cutoff 1.554: 746.7 M-cycles
step  640: timed with pme grid 36 36 36, coulomb cutoff 1.402: 661.4 M-cycles
step  720: timed with pme grid 40 40 36, coulomb cutoff 1.381: 640.6 M-cycles
step  800: timed with pme grid 40 40 40, coulomb cutoff 1.261: 635.5 M-cycles
step  880: timed with pme grid 42 42 40, coulomb cutoff 1.243: 639.3 M-cycles
step  960: timed with pme grid 42 42 42, coulomb cutoff 1.201: 649.2 M-cycles
              optimal pme grid 40 40 40, coulomb cutoff 1.261
           Step           Time         Lambda
           1000        2.00000        0.00000

	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 5001 steps using 51 frames

   Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.          LJ-14
    7.42902e+03    4.10842e+04    3.21342e+04    2.92088e+02    4.99130e+03
     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul. recip.      Potential
   -5.95565e+04   -1.06737e+04   -1.92424e+05    1.02981e+03   -1.75694e+05
    Kinetic En.   Total Energy    Temperature Pressure (bar)   Constr. rmsd
    8.45321e+04   -9.11616e+04    2.98841e+02   -8.24359e+01    0.00000e+00

          Box-X          Box-Y          Box-Z
    6.69363e+00    6.68348e+00    6.57835e+00

   Total Virial (kJ/mol)
    2.87684e+04    3.11713e+02   -2.75630e+00
    3.13663e+02    2.86418e+04   -1.21042e+02
   -1.91647e+00   -1.21727e+02    2.93717e+04

   Pressure (bar)
   -1.09311e+02   -3.46385e+01   -4.90594e+00
   -3.48584e+01   -1.00506e+02    2.54664e+00
   -4.99980e+00    2.62332e+00   -3.74899e+01

        T-Water    T-non-Water
    2.99554e+02    2.98406e+02

       P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.200 nm  1.235 nm      44  42  42   0.160 nm  0.384 nm
   final    1.261 nm  1.296 nm      40  40  40   0.168 nm  0.404 nm
 cost-ratio           1.16             0.82
 (note that these numbers concern only part of the total PP and PME load)

	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 NB VdW [V&F]                           167.713536         167.714     0.0
 Pair Search distance check             620.252992        5582.277     0.0
 NxN Ewald Elec. + LJ [F]            205928.542336    16062426.302    96.7
 NxN Ewald Elec. + LJ [V&F]            2110.154688      272209.955     1.6
 1,4 nonbonded interactions             232.366464       20912.982     0.1
 Calc Weights                           458.596701       16509.481     0.1
 Spread Q Bspline                      9783.396288       19566.793     0.1
 Gather F Bspline                      9783.396288       58700.378     0.4
 3D-FFT                                9752.042820       78016.343     0.5
 Solve PME                                7.771200         497.357     0.0
 Shift-X                                  3.851442          23.109     0.0
 Bonds                                   33.926784        2001.680     0.0
 Propers                                283.576704       64939.065     0.4
 Impropers                                1.280256         266.293     0.0
 Virial                                  15.336612         276.059     0.0
 Stop-CM                                  1.589484          15.895     0.0
 Calc-Ekin                               30.628134         826.960     0.0
 Lincs                                   53.792256        3227.535     0.0
 Lincs-Mat                              361.176576        1444.706     0.0
 Constraint-V                           172.103814        1376.831     0.0
 Constraint-Vir                          11.851155         284.428     0.0
 Settle                                  21.517903        6950.283     0.0
-----------------------------------------------------------------------------
 Total                                                16616222.423   100.0
-----------------------------------------------------------------------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1    8        126       0.580         15.282   2.1
 Launch GPU ops.        1    8       5001       0.438         11.541   1.6
 Force                  1    8       5001      10.488        276.222  37.4
 PME mesh               1    8       5001       5.459        143.759  19.5
 Wait GPU local         1    8       5001       0.657         17.312   2.3
 NB X/F buffer ops.     1    8       9876       0.352          9.265   1.3
 Write traj.            1    8         11       0.353          9.285   1.3
 Update                 1    8       5001       1.604         42.248   5.7
 Constraints            1    8       5001       4.730        124.571  16.9
 Rest                                           3.397         89.455  12.1
-----------------------------------------------------------------------------
 Total                                         28.057        738.939 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread/gather      1    8      10002       4.435        116.816  15.8
 PME 3D-FFT             1    8      10002       0.833         21.939   3.0
 PME solve Elec         1    8       5001       0.169          4.451   0.6
-----------------------------------------------------------------------------

 GPU timings
-----------------------------------------------------------------------------
 Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
 Pair list H2D                        126       0.029        0.230     0.3
 X / q H2D                           5001       0.260        0.052     2.3
 Nonbonded F kernel                  4850      10.204        2.104    91.4
 Nonbonded F+ene k.                    25       0.089        3.554     0.8
 Nonbonded F+prune k.                 100       0.278        2.778     2.5
 Nonbonded F+ene+prune k.              26       0.103        3.944     0.9
 F D2H                               5001       0.199        0.040     1.8
-----------------------------------------------------------------------------
 Total                                         11.160        2.232   100.0
-----------------------------------------------------------------------------

Force evaluation time GPU/CPU: 2.232 ms/3.189 ms = 0.700
For optimal performance this ratio should be close to 1!

NOTE: The GPU has >25% less load than the CPU. This imbalance causes
      performance loss.

               Core t (s)   Wall t (s)        (%)
       Time:      218.210       28.057      777.7
                 (ns/day)    (hour/ns)
Performance:       30.800        0.779
Finished mdrun on rank 0 Tue Jan 19 18:49:24 2016

-------------------------------------------------------------------
Michail (Michalis) Palaiokostas
PhD Student
School of Engineering and Materials Science
Queen Mary University of London
-------------------------------------------------------------------

________________________________________
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Szilárd Páll <pall.szilard at gmail.com>
Sent: Tuesday, January 19, 2016 9:05 PM
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] GROMACS performance with an NVidia Tesla k40c

Hi,

On Tue, Jan 19, 2016 at 8:34 PM, Michail Palaiokostas Avramidis <
m.palaiokostas at qmul.ac.uk> wrote:

> Dear GMX users,
>
>
> I have recently installed an Nvidia Tesla K40c in my workstation (already
> had a quadro k2000) and I am currently trying to optimize its usage with
> GROMACS. I used two compilations of GROMACS, one is the standard one as
> suggested in the beginning of the installation documentation and one where
> I added some more flags to see what will happen. The latter compilation
> used:
>
>
> cmake .. -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_GPU=on
> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-7.5
> -DNVML_INCLUDE_DIR=/usr/include/nvidia/gdk
> -DNVML_LIBRARY=/usr/lib/nvidia-352/libnvidia-ml.so
>
>
Looks reasonable.

>
> So far I used 4 different combinations to test a water-membrane system of
> ~30500 atoms for 5000 steps:
>
> 1) CPU only,
>
> 2) CPU+2GPUs (the default),
>
> 3) CPU+Quadro and
>
> 4) CPU+Tesla.
>
> Obviously the fastest is the Tesla one with 31ns/day. This is 3.6 times
> faster than the CPU-only setup.
>
>
> While this is good, I am not entirely satisfied with the speed-up. Do you
> think is normal? Would you expect more?
>

3.6x is perfectly normal; the typical GPU acceleration improvement is 2-4x.

What makes you unsatisfied; why do you expect more speedup? (If you happen
to be comparing to the speedup of other MD packages, do consider that
GROMACS has highly optimized SIMD CPU kernels which makes it quite fast on
CPUs only. With an already highly optimized baseline it's harder get high
speedup, no matter what kind of accelerator you use.

> One thing I noticed is that there was absolutely no difference with using
> the custom, GPU-oriented compilation of GROMACS. Did I miss something there?
>

Not sure what you're referring to here, could you clarify?

> The second thing I noticed is that even by increasing nstlist the
> performance remained the same (despite the suggestion in the documentation).
>

Increasing from what value to what value? Note that mdrun will by default
increase nstlist if the initial value is small.
See Table 2 and related text in http://doi.wiley.com/10.1002/jcc.24030.

Finally, in my log file I got the message (the actual log is attached to
> the message):
>
> Force evaluation time GPU/CPU: 2.232 ms/3.189 ms = 0.700
>
> For optimal performance this ratio should be close to 1!
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> performance loss.
>
>
> Can you please help me solve this imbalance? At the moment I am executing
> gromacs with: gmx mdrun -v -deffnm npt-ini -gpu_id 0
>

The automated CPU-GPU load balancer should address this on its own - if
possible. If your CPU is relatively slow, there is often not much more to
do.

Post log files of your runs and we may be able to suggest more.

Cheers,
--
Szilárd

Thank you in advance for your help.
>
>
> Best Regards,
>
> Michail
>
>
> -------------------------------------------------------------------
> Michail (Michalis) Palaiokostas
> PhD Student
> School of Engineering and Materials Science
> Queen Mary University of London
> -------------------------------------------------------------------
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.