[gmx-users] cpu gpu performance

Mon Jan 5 00:53:56 CET 2015

On Sun, Jan 4, 2015 at 5:41 PM, <h.alizadeh at znu.ac.ir> wrote:

> Dear Users,
> I'm simulating a membrane protein system with approximately 185000 atoms
> with an  Intel Corei7 cpu.
> I have two questions:
> 1. Performance of my simulations is about 1.8ns/day. Is this performance
> normal for such a system? Or my simulations are suffering from lack of
> performance?
>

The actual performance depends on everything, of course, but this number is
believable.

> 2. when I use mdrun command with -nb gpu, the performance reduces to
> 1.3ns/day!! How can I resolve this problem?
>

mdrun does a simple offload of all the short-ranged non-bonded work to the
GPU. If the GPU is slow relative to the CPU, then that can be a net loss.
Alternatively, this system could be too large for efficient use of older
GPUs - I don't know what the expected behaviour would be.

my mdp file parameters are:
> integrator              = md
> dt                      = 0.002
> nsteps                  = 15000000
> nstlog                  = 1000
> nstxout                 = 5000
> nstvout                 = 5000
> nstfout                 = 5000
> nstcalcenergy           = 100
> nstenergy               = 1000
> nstxtcout    = 2000        ; xtc compressed trajectory output every 2 ps
> ;
> cutoff-scheme           = Verlet
> nstlist                 = 20
> rlist                   = 1.0
> coulombtype             = pme
> rcoulomb                = 1.0
> vdwtype                 = Cut-off
> vdw-modifier            = Force-switch
> rvdw_switch             = 0.9
> rvdw                    = 1.0
>

I hope you know why you're using this particular combination of non-bonded
settings. In particular, the use of force-switch requires the short-range
implementation to use table lookups. These tend to be slower than the
implementations of alternative modifiers.

> ;
> tcoupl                  = berendsen
>

Side point - there are known problems with using the Berendsen thermostat
for production simulation. Use something else.

tc_grps                 = PROT   NPROT   SOL_ION
> tau_t                   = 1.0    1.0     1.0
> ref_t                   = 303.15   303.15   303.15
> ;
> pcoupl                  = berendsen
> pcoupltype              = semiisotropic
> tau_p                   = 5.0     5.0
> compressibility         = 4.5e-5  4.5e-5
> ref_p                   = 1.0     1.0
> ;
> ;
> constraints             = h-bonds
> constraint_algorithm    = LINCS
> continuation        = yes
> ;
> nstcomm                 = 100
> comm_mode               = linear
> comm_grps               = PROT   NPROT   SOL_ION
> ;
> refcoord_scaling        = com
> and at the end of log file when I use gpu I have:
>
> NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
>
> -----------------------------------------------------------------------------
>  NB VdW [V&F]                            65.721780          65.722     0.0
>  Pair Search distance check             354.095696        3186.861     0.1
>  NxN QSTab Elec. + LJ [F]             78361.108992     4153138.777    92.2
>

Here's the quadratic-spline tabulated kernels being flagged.

>  NxN QSTab Elec. + LJ [V&F]            1094.086656       88621.019     2.0
>  1,4 nonbonded interactions              92.366244        8312.962     0.2
>  Calc Weights                           273.463938        9844.702     0.2
>  Spread Q Bspline                      5833.897344       11667.795     0.3
>  Gather F Bspline                      5833.897344       35003.384     0.8
>  3D-FFT                               19866.277292      158930.218     3.5
>  Solve PME                                5.271904         337.402     0.0
>  Shift-X                                  2.625854          15.755     0.0
>  Bonds                                   14.647068         864.177     0.0
>  Propers                                106.938468       24488.909     0.5
>  Impropers                                1.961496         407.991     0.0
>  Virial                                   4.877756          87.800     0.0
>  Stop-CM                                  1.125366          11.254     0.0
>  Calc-Ekin                                9.753172         263.336     0.0
>  Lincs                                   20.162196        1209.732     0.0
>  Lincs-Mat                              129.913632         519.655     0.0
>  Constraint-V                            96.517170         772.137     0.0
>  Constraint-Vir                           4.084834          98.036     0.0
>  Settle                                  18.730926        6050.089     0.1
>  (null)                                   0.653184           0.000     0.0
>
> -----------------------------------------------------------------------------
>  Total                                                 4503897.712   100.0
>
> -----------------------------------------------------------------------------
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
>
> -----------------------------------------------------------------------------
>  Neighbor search        1    8         14       0.301          8.175   0.4
>  Launch GPU ops.        1    8        486       0.063          1.719   0.1
>  Force                  1    8        486       4.351        118.334   6.3
>  PME mesh               1    8        486       8.685        236.229  12.5
>  Wait GPU local         1    8        486      52.321       1423.144  75.5
>

and here's the CPU spending 75% of its time waiting for the GPU.

>  NB X/F buffer ops.     1    8        958       0.389         10.571   0.6
>  Write traj.            1    8          1       0.265          7.221   0.4
>  Update                 1    8        486       0.989         26.887   1.4
>  Constraints            1    8        486       1.041         28.308   1.5
>  Rest                                           0.915         24.895   1.3
>
> -----------------------------------------------------------------------------
>  Total                                         69.319       1885.482 100.0
>
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
>  PME spread/gather      1    8        972       5.574        151.608   8.0
>  PME 3D-FFT             1    8        972       2.862         77.836   4.1
>  PME solve Elec         1    8        486       0.216          5.880   0.3
>
> -----------------------------------------------------------------------------
>
>  GPU timings
>
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
>
> -----------------------------------------------------------------------------
>  Pair list H2D                         14       0.027        1.919     0.0
>  X / q H2D                            486       0.262        0.539     0.4
>  Nonbonded F kernel                   460      59.334      128.988    90.8
>

and the GPU is just taking a long time to get its work done.

Mark

 Nonbonded F+ene k.                    12       2.819      234.875     4.3
>  Nonbonded F+ene+prune k.              14       2.761      197.239     4.2
>  F D2H                                486       0.174        0.359     0.3
>
> -----------------------------------------------------------------------------
>  Total                                         65.378      134.522   100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 134.522 ms/26.822 ms = 5.015
> For optimal performance this ratio should be close to 1!
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>       performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:      550.116       69.319      793.6
>                  (ns/day)    (hour/ns)
> Performance:        1.212       19.810
>
> Best,
> Hadi
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>