[gmx-users] How to redirect the calculation load toward GPU

Tue Apr 8 17:19:00 CEST 2014

Here below are listed a chunk of the log file:

Log file opened on Tue Apr  8 09:57:06 2014
Host: Obsidian03  pid: 10221  nodeid: 0  nnodes:  1
Gromacs version:    VERSION 4.6.5
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        enabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   SSE2
FFT library:        fftw-3.3.3-sse2
Large file support: enabled
RDTSCP usage:       enabled
Built on:           Thu Feb 13 17:01:44 CET 2014
Built by:           portage at Chlorine01 [CMAKE]
Build OS/arch:      Linux 3.10.17-gentoo-Generic-x64 x86_64
Build CPU vendor:   AuthenticAMD
Build CPU brand:    AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
Build CPU family:   15   Model: 107   Stepping: 2
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr pse 
rdtscp sse2 sse3
C compiler:         /usr/bin/x86_64-pc-linux-gnu-gcc GNU 
x86_64-pc-linux-gnu-gcc (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3
C compiler flags:   -msse2    -Wextra -Wno-missing-field-initializers 
-Wno-sign-compare -Wall -Wno-unused -Wunused-value  -march=native -O2 
-pipe -fomit-frame-pointer
C++ compiler:       /usr/bin/x86_64-pc-linux-gnu-g++ GNU 
x86_64-pc-linux-gnu-g++ (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3
C++ compiler flags: -msse2   -Wextra -Wno-missing-field-initializers 
-Wno-sign-compare -Wall -Wno-unused -Wunused-value  -march=native -O2 
-pipe -fomit-frame-pointer
CUDA compiler:      /opt/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler 
driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on 
Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
CUDA compiler 
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;-Xcompiler;-fPIC 
; 
-msse2;-Wextra;-Wno-missing-field-initializers;-Wno-sign-compare;-Wall;-Wno-unused;-Wunused-value;-march=native;-O2;-pipe;-fomit-frame-pointer;
CUDA driver:        6.0
CUDA runtime:       5.50

                          :-)  G  R  O  M  A  C  S  (-:

            Glycine aRginine prOline Methionine Alanine Cystine Serine

                             :-)  VERSION 4.6.5  (-:

         Contributions from Mark Abraham, Emile Apol, Rossen Apostolov,
            Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,
      Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans,
         Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff,
            Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
                 Michael Shirts, Alfons Sijbers, Peter Tieleman,

                Berk Hess, David van der Spoel, and Erik Lindahl.

        Copyright (c) 1991-2000, University of Groningen, The Netherlands.
          Copyright (c) 2001-2012,2013, The GROMACS development team at
         Uppsala University & The Royal Institute of Technology, Sweden.
             check out http://www.gromacs.org for more information.

          This program is free software; you can redistribute it and/or
        modify it under the terms of the GNU Lesser General Public License
         as published by the Free Software Foundation; either version 2.1
              of the License, or (at your option) any later version.

                                 :-)  mdrun  (-:

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

For optimal performance with a GPU nstlist (now 5) should be larger.
The optimum depends on your CPU and GPU resources.
You might want to try several nstlist values.
Changing nstlist from 5 to 40, rlist from 0.9 to 0.999

Input Parameters:
    integrator           = md
    nsteps               = 500000
    init-step            = 0
    cutoff-scheme        = Verlet
    ns_type              = Grid
    nstlist              = 40
    ndelta               = 2
    nstcomm              = 100
    comm-mode            = Linear
    nstlog               = 500
    nstxout              = 10000
    nstvout              = 10000
    nstfout              = 10000
    nstcalcenergy        = 100
    nstenergy            = 2000
    nstxtcout            = 500
    init-t               = 0
    delta-t              = 0.002
    xtcprec              = 1000
    fourierspacing       = 0.12
    nkx                  = 64
    nky                  = 72
    nkz                  = 56
    pme-order            = 4
    ewald-rtol           = 1e-05
    ewald-geometry       = 0
    epsilon-surface      = 0
    optimize-fft         = TRUE
    ePBC                 = xyz
    bPeriodicMols        = FALSE
    bContinuation        = TRUE
    bShakeSOR            = FALSE
    etc                  = Berendsen
    bPrintNHChains       = FALSE
    nsttcouple           = 5
    epc                  = Berendsen
    epctype              = Isotropic
    nstpcouple           = 5
    tau-p                = 1
    ref-p (3x3):
       ref-p[    0]={ 1.00000e+00,  0.00000e+00, 0.00000e+00}
       ref-p[    1]={ 0.00000e+00,  1.00000e+00, 0.00000e+00}
       ref-p[    2]={ 0.00000e+00,  0.00000e+00, 1.00000e+00}
    compress (3x3):
       compress[    0]={ 4.50000e-05,  0.00000e+00, 0.00000e+00}
       compress[    1]={ 0.00000e+00,  4.50000e-05, 0.00000e+00}
       compress[    2]={ 0.00000e+00,  0.00000e+00, 4.50000e-05}
    refcoord-scaling     = No
    posres-com (3):
       posres-com[0]= 0.00000e+00
       posres-com[1]= 0.00000e+00
       posres-com[2]= 0.00000e+00
    posres-comB (3):
       posres-comB[0]= 0.00000e+00
       posres-comB[1]= 0.00000e+00
       posres-comB[2]= 0.00000e+00
    verlet-buffer-drift  = 0.005
    rlist                = 0.999
    rlistlong            = 0.999
    nstcalclr            = 5
    rtpi                 = 0.05
    coulombtype          = PME
    coulomb-modifier     = Potential-shift
    rcoulomb-switch      = 0
    rcoulomb             = 0.9
    vdwtype              = Cut-off
    vdw-modifier         = Potential-shift
    rvdw-switch          = 0
    rvdw                 = 0.9
    epsilon-r            = 1
    epsilon-rf           = inf
    tabext               = 1
    implicit-solvent     = No
    gb-algorithm         = Still
    gb-epsilon-solvent   = 80
    nstgbradii           = 1
    rgbradii             = 1
    gb-saltconc          = 0
    gb-obc-alpha         = 1
    gb-obc-beta          = 0.8
    gb-obc-gamma         = 4.85
    gb-dielectric-offset = 0.009
    sa-algorithm         = Ace-approximation
    sa-surface-tension   = 2.05016
    DispCorr             = No
    bSimTemp             = FALSE
    free-energy          = no
    nwall                = 0
    wall-type            = 9-3
    wall-atomtype[0]     = -1
    wall-atomtype[1]     = -1
    wall-density[0]      = 0
    wall-density[1]      = 0
    wall-ewald-zfac      = 3
    pull                 = no
    rotation             = FALSE
    disre                = No
    disre-weighting      = Conservative
    disre-mixed          = FALSE
    dr-fc                = 1000
    dr-tau               = 0
    nstdisreout          = 100
    orires-fc            = 0
    orires-tau           = 0
    nstorireout          = 100
    dihre-fc             = 0
    em-stepsize          = 0.01
    em-tol               = 10
    niter                = 20
    fc-stepsize          = 0
    nstcgsteep           = 1000
    nbfgscorr            = 10
    ConstAlg             = Lincs
    shake-tol            = 0.0001
    lincs-order          = 4
    lincs-warnangle      = 30
    lincs-iter           = 1
    bd-fric              = 0
    ld-seed              = 1993
    cos-accel            = 0
    deform (3x3):
       deform[    0]={ 0.00000e+00,  0.00000e+00, 0.00000e+00}
       deform[    1]={ 0.00000e+00,  0.00000e+00, 0.00000e+00}
       deform[    2]={ 0.00000e+00,  0.00000e+00, 0.00000e+00}
    adress               = FALSE
    userint1             = 0
    userint2             = 0
    userint3             = 0
    userint4             = 0
    userreal1            = 0
    userreal2            = 0
    userreal3            = 0
    userreal4            = 0
grpopts:
    nrdf:     7097.73     71223.3     41.9984     11.9995
    ref-t:         300         300         300         300
    tau-t:         0.1         0.1         0.1         0.1
anneal:          No          No          No          No
ann-npoints:           0           0 0           0
    acc:               0           0           0
    nfreeze:           N           N           N
    energygrp-flags[  0]: 0 0 0 0
    energygrp-flags[  1]: 0 0 0 0
    energygrp-flags[  2]: 0 0 0 0
    energygrp-flags[  3]: 0 0 0 0
    efield-x:
       n = 0
    efield-xt:
       n = 0
    efield-y:
       n = 0
    efield-yt:
       n = 0
    efield-z:
       n = 0
    efield-zt:
       n = 0
    bQMMM                = FALSE
    QMconstraints        = 0
    QMMMscheme           = 0
    scalefactor          = 1
qm-opts:
    ngQM                 = 0
Using 1 MPI thread
Using 1 OpenMP thread

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: AuthenticAMD
Brand:  AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
Family: 15  Model: 107  Stepping:  2
Features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr pse rdtscp sse2 sse3
Acceleration most likely to fit this hardware: SSE2
Acceleration selected at GROMACS compile time: SSE2

1 GPU detected:
   #0: NVIDIA GeForce GTX 480, compute cap.: 2.0, ECC: no, stat: compatible

1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0

Will do PME sum in reciprocal space.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. 
Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 0.999   Coulomb: 0.9   LJ: 0.9
System total charge: -0.000
Generated table with 999 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 999 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 999 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 999 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using CUDA 8x8 non-bonded kernels

NOTE: With GPUs, reporting energy group contributions is not supported

Potential shift: LJ r^-12: 3.541 r^-6 1.882, Ewald 1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 5.87e-04 size: 1536

Initializing LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and H. Bekker and H. J. C. Berendsen and J. G. E. M. Fraaije
LINCS: A Linear Constraint Solver for molecular simulations
J. Comp. Chem. 18 (1997) pp. 1463-1472
-------- -------- --- Thank You --- -------- --------

The number of constraints is 3636

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
   0:  rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, J. P. M. Postma, A. DiNola and J. R. Haak
Molecular dynamics with coupling to an external bath
J. Chem. Phys. 81 (1984) pp. 3684-3690
-------- -------- --- Thank You --- -------- --------

There are: 39209 Atoms
There are: 4 VSites
Initial temperature: 299.37 K

Started mdrun on node 0 Tue Apr  8 09:57:08 2014

            Step           Time         Lambda
               0        0.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.33721e+03    4.51248e+02    3.79061e+03 5.27540e+03    1.62273e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.73356e+04   -6.50993e+05    5.62839e+03 -5.24947e+05    9.80531e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26894e+05    3.00938e+02    4.75944e+02 1.62529e-04

step   80: timed with pme grid 64 72 56, coulomb cutoff 0.900: 7449.1 
M-cycles
step  160: timed with pme grid 60 64 52, coulomb cutoff 0.969: 6643.5 
M-cycles
step  240: timed with pme grid 52 56 48, coulomb cutoff 1.066: 5848.3 
M-cycles
step  320: timed with pme grid 48 52 44, coulomb cutoff 1.154: 5451.7 
M-cycles
step  400: timed with pme grid 44 48 40, coulomb cutoff 1.259: 5155.4 
M-cycles
step  480: timed with pme grid 40 44 36, coulomb cutoff 1.399: 5345.2 
M-cycles
            Step           Time         Lambda
             500        1.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.42010e+03    4.62095e+02    3.80227e+03 5.25222e+03    1.62479e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.70828e+04   -6.46020e+05    8.93429e+02 -5.24859e+05    9.78962e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26963e+05    3.00457e+02   -4.08038e+01 1.79131e-04

step  560: timed with pme grid 36 40 32, coulomb cutoff 1.574: 5623.4 
M-cycles
step  640: timed with pme grid 32 36 28, coulomb cutoff 1.799: 6059.6 
M-cycles
step  720: timed with pme grid 52 52 44, coulomb cutoff 1.146: 5551.0 
M-cycles
step  800: timed with pme grid 48 52 44, coulomb cutoff 1.154: 5444.4 
M-cycles
step  880: timed with pme grid 48 52 42, coulomb cutoff 1.199: 5416.8 
M-cycles
step  960: timed with pme grid 48 48 42, coulomb cutoff 1.241: 5219.8 
M-cycles
            Step           Time         Lambda
            1000        2.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.62333e+03    4.07026e+02    3.86990e+03 5.24986e+03    1.62035e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.67999e+04   -6.45748e+05    1.69649e+03 -5.23898e+05    9.71380e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26760e+05    2.98129e+02    6.34793e+01 1.81889e-04

step 1040: timed with pme grid 44 48 40, coulomb cutoff 1.259: 5147.3 
M-cycles
step 1120: timed with pme grid 42 48 40, coulomb cutoff 1.319: 5222.1 
M-cycles
step 1200: timed with pme grid 42 44 40, coulomb cutoff 1.354: 5265.5 
M-cycles
step 1280: timed with pme grid 40 44 40, coulomb cutoff 1.385: 5323.8 
M-cycles
step 1360: timed with pme grid 40 44 36, coulomb cutoff 1.399: 5330.7 
M-cycles
step 1440: timed with pme grid 40 42 36, coulomb cutoff 1.418: 5362.1 
M-cycles
            Step           Time         Lambda
            1500        3.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.48854e+03    4.01389e+02    3.82475e+03 5.30181e+03    1.62800e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.77397e+04   -6.47116e+05    1.04295e+03 -5.25037e+05    9.81473e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26890e+05    3.01227e+02    1.97365e+02 1.61429e-04

step 1520: timed with pme grid 40 40 36, coulomb cutoff 1.489: 5479.8 
M-cycles
step 1600: timed with pme grid 36 40 36, coulomb cutoff 1.539: 5561.7 
M-cycles
step 1680: timed with pme grid 36 40 32, coulomb cutoff 1.574: 5621.6 
M-cycles
step 1760: timed with pme grid 36 36 32, coulomb cutoff 1.655: 
18446744079486.4 M-cycles
step 1840: timed with pme grid 32 36 32, coulomb cutoff 1.732: 5910.1 
M-cycles
               optimal pme grid 44 48 40, coulomb cutoff 1.259
            Step           Time         Lambda
            2000        4.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.15902e+03    4.73998e+02    3.80519e+03 5.28955e+03    1.60984e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.67239e+04   -6.46186e+05    1.73056e+03 -5.24905e+05    9.82208e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26685e+05    3.01453e+02   -1.32508e+02 1.73835e-04

            Step           Time         Lambda
            2500        5.00000        0.00000

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.43140e+03    4.52490e+02    3.53021e+03 5.30039e+03    1.62136e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.76721e+04   -6.46736e+05    1.70041e+03 -5.24435e+05    9.78989e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.26536e+05    3.00465e+02    5.56417e+01 1.73350e-04

            Step           Time         Lambda
            3000        6.00000        0.00000

[...]

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.11055e+03    4.25118e+02    3.46453e+03 5.28507e+03    1.64961e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.79810e+04   -6.49054e+05    1.71960e+03 -5.26572e+05    9.86032e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.27969e+05    3.02626e+02    1.38984e+01 1.74222e-04

            Step           Time         Lambda
          500000     1000.00000        0.00000

Writing checkpoint, step 500000 at Tue Apr  8 16:24:42 2014

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.13014e+03    4.02446e+02    3.53093e+03 5.31865e+03    1.64910e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.59428e+04   -6.46111e+05    1.64510e+03 -5.25650e+05    9.77048e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.27945e+05    2.99869e+02   -1.28891e+02 1.61440e-04

     <======  ###############  ==>
     <====  A V E R A G E S  ====>
     <==  ###############  ======>

     Statistics over 500001 steps using 5001 frames

    Energies (kJ/mol)
           Angle    Proper Dih. Ryckaert-Bell. LJ-14     Coulomb-14
     7.24325e+03    4.20614e+02    3.45246e+03 5.28335e+03    1.63563e+04
         LJ (SR)   Coulomb (SR)   Coul. recip. Potential    Kinetic En.
     8.70652e+04   -6.47075e+05    1.68233e+03 -5.25571e+05    9.77395e+04
    Total Energy    Temperature Pressure (bar)   Constr. rmsd
    -4.27832e+05    2.99975e+02    2.65498e+00 0.00000e+00

           Box-X          Box-Y          Box-Z
     7.43280e+00    7.99132e+00    6.75655e+00

    Total Virial (kJ/mol)
     3.24737e+04   -2.66392e+01   -1.46523e+01
    -2.63966e+01    3.25516e+04    6.88809e+01
    -1.42370e+01    6.90065e+01    3.26181e+04

    Pressure (bar)
     6.32622e+00    1.69758e+00   -7.49524e-01
     1.67749e+00    1.75993e+00   -3.96960e+00
    -7.83905e-01   -3.98004e+00   -1.21194e-01

   Epot (kJ/mol)        Coul-SR          LJ-SR Coul-14          LJ-14
Protein-Protein   -6.47075e+05    8.70652e+04 1.63313e+04    5.22322e+03
     Protein-SOL    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
     Protein-UNK    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
     Protein-Ion    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
         SOL-SOL    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
         SOL-UNK    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
         SOL-Ion    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
         UNK-UNK    0.00000e+00    0.00000e+00 2.49760e+01    6.01314e+01
         UNK-Ion    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00
         Ion-Ion    0.00000e+00    0.00000e+00 0.00000e+00    0.00000e+00

       T-Protein          T-SOL          T-UNK T-Ion
     2.99940e+02    2.99980e+02    2.99534e+02 2.97557e+02

        P P   -   P M E   L O A D   B A L A N C I N G

  PP/PME load balancing changed the cut-off and PME settings:
            particle-particle                    PME
             rcoulomb  rlist            grid spacing   1/beta
    initial  0.900 nm  0.999 nm      64  72  56   0.120 nm  0.288 nm
    final    1.259 nm  1.358 nm      44  48  40   0.168 nm  0.403 nm
  cost-ratio           2.51             0.33
  (note that these numbers concern only part of the total PP and PME load)

     M E G A - F L O P S   A C C O U N T I N G

  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
  RF=Reaction-Field  VdW=Van der Waals QSTab=quadratic-spline table
  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
  V&F=Potential and force  V=Potential only  F=Force only

  Computing: M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
  Pair Search distance check           84910.270848 764192.438     0.1
  NxN QSTab Elec. + VdW [F]         25223069.779264 1034145860.950    95.9
  NxN QSTab Elec. + VdW [V&F]         254850.190208 15036161.222     1.4
  1,4 nonbonded interactions            4704.009408 423360.847     0.0
  Calc Weights                         58819.617639 2117506.235     0.2
  Spread Q Bspline                   1254818.509632 2509637.019     0.2
  Gather F Bspline                   1254818.509632 7528911.058     0.7
  3D-FFT                             1382948.519652 11063588.157     1.0
  Solve PME                             1055.989952 67583.357     0.0
  Shift-X                                490.201713 2941.210     0.0
  Angles                                3264.006528 548353.097     0.1
  Propers                                377.500755 86447.673     0.0
  RB-Dihedrals                          3720.507441 918965.338     0.1
  Virial                                3925.839258 70665.107     0.0
  Stop-CM                                196.143426 1961.434     0.0
  P-Coupling                           19606.539213 117639.235     0.0
  Calc-Ekin                             7842.678426 211752.318     0.0
  Lincs                                 1818.003636 109080.218     0.0
  Lincs-Mat                            39168.078336 156672.313     0.0
  Constraint-V                         21442.542885 171540.343     0.0
  Constraint-Vir                        3924.939249 94198.542     0.0
  Settle                                5935.511871 1917170.334     0.2
  Virtual Site 2 2.400008          55.200     0.0
-----------------------------------------------------------------------------
  Total 1078064243.645   100.0
-----------------------------------------------------------------------------

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

  Computing:         Nodes   Th.     Count  Wall t (s) G-Cycles       %
-----------------------------------------------------------------------------
  Vsite constr.          1    1     500001 4.359       12.641     0.0
  Neighbor search        1    1      12501     629.074 1824.467     2.7
  Launch GPU ops.        1    1     500001 100.561      291.650     0.4
  Force                  1    1     500001    2266.241 6572.645     9.7
  PME mesh               1    1     500001   15297.280 44365.799    65.8
  Wait GPU local         1    1     500001 164.817      478.010     0.7
  NB X/F buffer ops.     1    1     987501     798.856 2316.876     3.4
  Vsite spread           1    1     600002 5.753       16.684     0.0
  Write traj.            1    1       1023 6.518       18.904     0.0
  Update                 1    1     500001     557.942 1618.167     2.4
  Constraints            1    1     500001    2758.223 7999.512    11.9
  Rest                   1                     665.131 1929.040     2.9
-----------------------------------------------------------------------------
  Total                  1                   23254.756 67444.397   100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
  PME spread/gather      1    1    1000002   13321.576 38635.780    57.3
  PME 3D-FFT             1    1    1000002    1358.828 3940.929     5.8
  PME solve              1    1     500001     611.302 1772.923     2.6
-----------------------------------------------------------------------------

  GPU timings
-----------------------------------------------------------------------------
  Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
  Pair list H2D                      12501 13.744        1.099     0.2
  X / q H2D                         500001 167.017        0.334     2.4
  Nonbonded F kernel                400000 4858.062       12.145    69.5
  Nonbonded F+ene k.                 87500 1360.304       15.546    19.5
  Nonbonded F+ene+prune k.           12501 200.021       16.000     2.9
  F D2H                             500001 389.379        0.779     5.6
-----------------------------------------------------------------------------
  Total 6988.527       13.977   100.0
-----------------------------------------------------------------------------

Force evaluation time GPU/CPU: 13.977 ms/35.127 ms = 0.398
For optimal performance this ratio should be close to 1!

NOTE: The GPU has >25% less load than the CPU. This imbalance causes
       performance loss.

                Core t (s)   Wall t (s)        (%)
        Time:    23228.620    23254.756       99.9
                          6h27:34
                  (ns/day)    (hour/ns)
Performance:        3.715        6.460
Finished mdrun on node 0 Tue Apr  8 16:24:42 2014

On 07/04/2014 17:32,  Szil?rd P?ll <pall.szilard at gmail.com> wrote:
> Please post a log file, that would help with giving you more concrete
> advice. My guess is that you're running reaction-field (otherwise you
> must have tuned off PP-PME load balancing), but I'll comment more when
> I see a log file.
>
> Cheers,
> --
> Szil?rd
>
>
> On Mon, Apr 7, 2014 at 4:33 PM, Dario Corrada <dario.corrada at gmail.com> wrote:
>> I have a machine with AMD Athlon 64 dual core with an NVIDIA GeForce GTX
>> 480.
>>
>> In order to optimize performances I'd like to redirect my calculation toward
>> GPU as much as possible.
>>
>> I tried mdrun -nt 1 -nb gpu ..., but I have obtained such kind of message:
>>
>> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
>> performance loss.
>>
>> How can I improve mdrun performance?
>>
>> --
>> Dario CORRADA, PhD
>> Bioinformatics and Computational Chemistry specialist
>>
>> URL......: http://it.linkedin.com/in/dariocorrada/
>> mail.....: dario.corrada at gmail.com
>> skype....: dario.corrada
>> tel......: +39 333 5347024
>> address..: via Benvenuto Cellini, 4 - 20900 Monza IT
>>
>> "When you have eliminated the impossible, whatever remains, however
>> improbable, must be the truth."
>> [A.C. Doyle]