[gmx-users] GTX 960 vs Tesla K40

Mon Jun 18 23:35:04 CEST 2018

Persistence is enabled so I don't have to overclock again. To be honest, I
am still not entirely comfortable with the notion of ranks, after reading
the acceleration document a bunch of times. Parts of log file below and I
will obviously appreciate suggestions/clarifications:

Command line:
  gmx mdrun -nt 4 -ntmpi 2 -npme 1 -pme gpu -nb gpu -s run_unstretch.tpr -o
traj_unstretch.trr -g md.log -c unstretched.gro

GROMACS version:    2018
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  SSE4.1
FFT library:        fftw-3.3.5-sse2
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
Built on:           2018-02-13 19:43:29
Built by:           smolyan at MINTbox [CMAKE]
Build OS/arch:      Linux 4.4.0-112-generic x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU           W3530  @ 2.80GHz
Build CPU family:   6   Model: 26   Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler:         /usr/bin/cc GNU 5.4.0
C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:       /usr/bin/c++ GNU 5.4.0
C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on
Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;;
;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        9.10
CUDA runtime:       9.10

Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU           W3530  @ 2.80GHz
    Family: 6   Model: 26   Stepping: 5
    Features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc
pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0]
      Socket  1: [   1]
      Socket  2: [   2]
      Socket  3: [   3]
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla K40c, compute cap.: 3.5, ECC:  no, stat: compatible

................

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check          547029.956656     4923269.610     0.0
 NxN Ewald Elec. + LJ [F]         485658021.416832 32053429413.511    98.0
 NxN Ewald Elec. + LJ [V&F]         4905656.839680   524905281.846     1.6
 1,4 nonbonded interactions          140625.005625    12656250.506     0.0
 Reset In Box                          4599.000000       13797.000     0.0
 CG-CoM                                4599.018396       13797.055     0.0
 Bonds                                48000.001920     2832000.113     0.0
 Angles                               94650.003786    15901200.636     0.0
 RB-Dihedrals                        186600.007464    46090201.844     0.1
 Pos. Restr.                           2600.000104      130000.005     0.0
 Virial                                4610.268441       82984.832     0.0
 Stop-CM                                 91.998396         919.984     0.0
 Calc-Ekin                            45990.036792     1241730.993     0.0
 Constraint-V                        318975.012759     2551800.102     0.0
 Constraint-Vir                        3189.762759       76554.306     0.0
 Settle                              106325.004253    34342976.374     0.1
 Virtual Site 3                      107388.258506     3973365.565     0.0
-----------------------------------------------------------------------------
 Total                                             32703165544.282   100.0
-----------------------------------------------------------------------------

    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 0.0
 av. #atoms communicated per step for vsites: 3 x 0.0
 av. #atoms communicated per step for LINCS:  2 x 0.0

 Average PME mesh/force load: 1.193
 Part of the total run time spent waiting due to PP/PME imbalance: 5.1 %

NOTE: 5.1 % performance was lost because the PME ranks
      had more work to do than the PP ranks.
      You might want to increase the number of PME ranks
      or increase the cut-off and the grid spacing.

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank doing PP, using 2 OpenMP threads, and
on 1 MPI rank doing PME, using 2 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Domain decomp.         1    2     250000     975.157       5461.106   1.0
 DD comm. load          1    2      25002       0.009          0.053   0.0
 Vsite constr.          1    2   25000001    2997.638      16787.470   3.1
 Send X to PME          1    2   25000001     806.884       4518.740   0.8
 Neighbor search        1    2     250001    1351.275       7567.455   1.4
 Launch GPU ops.        1    2   50000002    7767.373      43499.093   8.0
 Comm. coord.           1    2   24750000       4.359         24.410   0.0
 Force                  1    2   25000001    8994.482      50371.185   9.3
 Wait + Comm. F         1    2   25000001       3.992         22.355   0.0
 PME mesh *             1    2   25000001   30757.016     172246.434  31.7
 PME wait for PP *                          17821.979      99807.221  18.3
 Wait + Recv. PME F     1    2   25000001    3355.753      18792.998   3.5
 Wait PME GPU gather    1    2   25000001   25539.917     143029.467  26.3
 Wait GPU NB nonloc.    1    2   25000001      61.503        344.432   0.1
 Wait GPU NB local      1    2   25000001   15384.720      86158.005  15.8
 NB X/F buffer ops.     1    2   99500002    1817.951      10180.950   1.9
 Vsite spread           1    2   25250002    3417.205      19137.139   3.5
 Write traj.            1    2       2554      18.100        101.362   0.0
 Update                 1    2   25000001    1832.047      10259.890   1.9
 Constraints            1    2   25000001    3232.961      18105.330   3.3
 Comm. energies         1    2    1250001       5.858         32.805   0.0
-----------------------------------------------------------------------------
 Total                                      48578.997     544107.322 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   194315.986    48578.997      400.0
                         13h29:38
                 (ns/day)    (hour/ns)
Performance:       88.927        0.270
Finished mdrun on rank 0 Mon Jun 18 07:42:59 2018

On Mon, Jun 18, 2018 at 3:23 PM, Szilárd Páll <pall.szilard at gmail.com>
wrote:

> On Mon, Jun 18, 2018 at 2:22 AM, Alex <nedomacho at gmail.com> wrote:
>
> > Thanks for the heads up. With the K40c instead of GTX 960 here's what I
> > did and here are the results:
> >
> > 1. Enabled persistence mode and overclocked the card via nvidia-smi:
> > http://acceleware.com/blog/gpu-boost-nvidias-tesla-k40-gpus
>
>
> Note that: persistence mode is only for convenience.
>
>
> > 2. Offloaded PME's FFT to GPU (which wasn't the case with GTX 960), this
> > brough the "pme mesh / force" ratio to something like 1.07.
> >
>
> I still think you are running multiple ranks which is unlikely to be ideal,
> but without seeing a log file, it's hard to tell..
>
> The result is a solid increase in performance on a small-ish system (20K
> > atoms): 90 ns/day instead of 65-70. I don't use this box for anything
> > except prototyping, but still the swap + tweaks were pretty useful.
>
>
> >
> > Alex
> >
> >
> >
> > On 6/15/2018 1:20 PM, Szilárd Páll wrote:
> >
> >> Hi,
> >>
> >> Regarding the K40 vs GTX 960 question, the K40 will likely be a bit
> >> faster (though it'l consume more power if that matters). The
> >> difference will be at most 20% in total performance, I think -- and
> >> with small systems likely negligible (as a smaller card with higher
> >> clocks is more efficient at small tasks than a large card with lower
> >> clocks).
> >>
> >> Regarding the load balance note, you are correct, the "pme mesh/force"
> >> means the ratio of time spent in computing PME forces on a separate
> >> task/rank and the rest of the forces (including nonbonded, bonded,
> >> etc.). With GPU offload this is a bit more tricky as the observed time
> >> is the time spent waiting for the GPU results, but the take-away is
> >> the same: when a run shows "pme mesh/force" far from 1, there is
> >> imbalance affecting performance.
> >>
> >> However, note that with a single GPU I've yet to see a case where you
> >> get better performance by running multiple ranks rather than simply
> >> running OpenMP-only. Also note that what a "weak GPU" can
> >> case-by-case, so I recommend taking the 1-2 minutes to do a short run
> >> and check for a certain hardware + simulation setup is it better to
> >> offload all of PME or keep the FFTs on the CPU.
> >>
> >> We'll do our best to automate more of these choices, but for now if
> >> you care about performance it's useful to test before doing long runs.
> >>
> >> Cheers,
> >> --
> >> Szilárd
> >>
> >>
> >> On Thu, Jun 14, 2018 at 2:09 AM, Alex <nedomacho at gmail.com> wrote:
> >>
> >>> Question: in the DD output (md.log) that looks like "DD  step xxxxxx
> pme
> >>> mesh/force 1.229," what is the ratio? Does it mean the pme calculations
> >>> take longer by the shown factor than the nonbonded interactions?
> >>> With GTX 960, the ratio is consistently ~0.85, with Tesla K40 it's
> ~1.25.
> >>> My mdrun line contains  -pmefft cpu (per Szilard's advice for weak
> GPUs,
> >>> I
> >>> believe). Would it then make sense to offload the fft to the K40?
> >>>
> >>> Thank you,
> >>>
> >>> Alex
> >>>
> >>> On Wed, Jun 13, 2018 at 4:53 PM, Alex <nedomacho at gmail.com> wrote:
> >>>
> >>> So, swap, then? Thank you!
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jun 13, 2018 at 4:49 PM, paul buscemi <pbuscemi at q.com> wrote:
> >>>>
> >>>>   flops trumps clock speed…..
> >>>>>
> >>>>> On Jun 13, 2018, at 3:45 PM, Alex <nedomacho at gmail.com> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I have an old "prototyping" box with a 4-core Xeon and an old GTX
> 960.
> >>>>>>
> >>>>> We
> >>>>>
> >>>>>> have a Tesla K40 laying around and there's only one PCIE slot
> >>>>>> available
> >>>>>>
> >>>>> in
> >>>>>
> >>>>>> this machine. Would it make sense to swap the cards, or is it
> already
> >>>>>> bottlenecked by the CPU? I compared the specs and 960 has a higher
> >>>>>> clock
> >>>>>> speed, while K40's FP performance is better. Should I swap the GPUs?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>> --
> >>>>>> Gromacs Users mailing list
> >>>>>>
> >>>>>> * Please search the archive at http://www.gromacs.org/Support
> >>>>>>
> >>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>
> >>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>>>>
> >>>>>> * For (un)subscribe requests visit
> >>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >>>>>>
> >>>>> send a mail to gmx-users-request at gromacs.org.
> >>>>>
> >>>>> --
> >>>>> Gromacs Users mailing list
> >>>>>
> >>>>> * Please search the archive at http://www.gromacs.org/Support
> >>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>
> >>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>>>
> >>>>> * For (un)subscribe requests visit
> >>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >>>>> send a mail to gmx-users-request at gromacs.org.
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>> Gromacs Users mailing list
> >>>
> >>> * Please search the archive at http://www.gromacs.org/Support
> >>> /Mailing_Lists/GMX-Users_List before posting!
> >>>
> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>
> >>> * For (un)subscribe requests visit
> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >>> send a mail to gmx-users-request at gromacs.org.
> >>>
> >>
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/Support
> > /Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>