[gmx-users] Nonsense timing accounting with OpenCL in Haswell

Thu Jan 24 10:27:17 CET 2019

Hi Elton,

It is very unlikely that you would be able to get any speedup with OpenCL with that CPU. For its generation the CPU is powerful but the GPU is a lower tier. To get meaningful speedup compared to running on i7 CPU you want to have a GT3 or higher GPU with at least 48EUs. Yours has 20EUs.
That the timing is nonsense might be caused by beignet. We haven’t tested beignet with GROMACS.
Other than for the GPU/CPU load balancing (which you could do manually) the timing shouldn’t affect the performance. So you can try runs with and without GPUs to determine what the speedup/slowdown is of using the GPU.
If you are interested to find out what exactly goes wrong with the timing, you might want to look at getLastRangeTime in src/gromacs/gpu_utils/gpuregiontimer_ocl.h. Maybe add some extra debug printing to see whether you notice a few abnormal timings or whether all are nonsensical.

Roland

From: Elton Carvalho <eltonfc at gmail.com<mailto:eltonfc at gmail.com>>
Date: Thu, 24 Jan 2019 at 04:06
Subject: [gmx-users] Nonsense timing accounting with OpenCL in Haswell
To: Discussion list for GROMACS users <gmx-users at gromacs.org<mailto:gmx-users at gromacs.org>>

Greetings!

I'm trying to set up  gromacs-2019 to use OpenCL with my Intel GPU,
integrated in a Haswell processor. The log file says it's detected as

Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
OpenCL 1.2 beignet 1.3, stat: compatible

I'm running beignet as the OpenCL driver because the NEO drivers don't seem
to support Haswell.

Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
performance and some really weird timings at the end of the logfile. Things
like:

1)  "Launch GPU ops." taking almost 90% of the run time
2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
5589922469 ms/step" in a 30-minute test run.
3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
real-time run.

My questions are:
a) Could this nonsense timing be coming from beignet, which is not _really_
supported? If not, wherefrom?
b) How can I troubleshoot this and get sensible timings to decide whether
using OpenCL in this machine is even worth it?
c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)

The whole log is available at
https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
relevant parts are below:

[ LOG FILE BEGINS ]
Command line:
  gmx mdrun -v -maxh .5 -notunepme

GROMACS version:    2019
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        OpenCL
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.2
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.3.0
C compiler flags:    -mavx2 -mfma     -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:       /usr/bin/c++ GNU 7.3.0
C++ compiler flags:  -mavx2 -mfma    -std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /usr/include
OpenCL library:     /usr/lib/libOpenCL.so
OpenCL version:     2.0

Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
    Family: 6   Model: 60   Stepping: 3
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Full, with devices
    Sockets, cores, and logical processors:
      Socket  0: [   0   4] [   1   5] [   2   6] [   3   7]
    Numa nodes:
      Node  0 (16704245760 bytes mem):   0   1   2   3   4   5   6   7
      Latency:
               0
         0  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
    PCI devices:
      0000:00:02.0  Id: 8086:0412  Class: 0x0300  Numa: 0
      0000:00:19.0  Id: 8086:153a  Class: 0x0200  Numa: 0
      0000:00:1f.2  Id: 8086:8c02  Class: 0x0106  Numa: 0
  GPU info:
    Number of GPUs detected: 1
    #0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
device version: OpenCL 1.2 beignet 1.3, stat: compatible

[... skipping ... ]

Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels

Changing nstlist from 10 to 40, rlist from 0.956 to 1.094

Using 1 MPI thread
Using 8 OpenMP threads

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

 [... skipping ...]
M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            4723.039760       42507.358     0.1
 NxN Ewald Elec. + LJ [F]            793216.982880    52352320.870    90.7
 NxN Ewald Elec. + LJ [V&F]            8092.108144      865855.571     1.5
 1,4 nonbonded interactions             562.216216       50599.459     0.1
 Calc Weights                          4087.128672      147136.632     0.3
 Spread Q Bspline                     87192.078336      174384.157     0.3
 Gather F Bspline                     87192.078336      523152.470     0.9
 3D-FFT                              398671.243138     3189369.945     5.5
 Solve PME                              100.010000        6400.640     0.0
 Shift-X                                 34.192224         205.153     0.0
 Angles                                 325.472544       54679.387     0.1
 Propers                                548.454840      125596.158     0.2
 Impropers                               40.564056        8437.324     0.0
 Virial                                  13.763169         247.737     0.0
 Stop-CM                                 13.894848         138.948     0.0
 Calc-Ekin                              272.720448        7363.452     0.0
 Lincs                                  131.439420        7886.365     0.0
 Lincs-Mat                             3692.627456       14770.510     0.0
 Constraint-V                          1391.078160       11128.625     0.0
 Constraint-Vir                          12.719940         305.279     0.0
 Settle                                 376.112800      121484.434     0.2
 Virtual Site 3                          21.012160         777.450     0.0
 Virtual Site 3fd                        19.880736        1888.670     0.0
 Virtual Site 3fad                        3.313456         583.168     0.0
 Virtual Site 3out                       57.217728        4977.942     0.0
 Virtual Site 4fdn                       16.486464        4187.562     0.0
-----------------------------------------------------------------------------
 Total                                                57716385.269   100.0
-----------------------------------------------------------------------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Vsite constr.          1    8      10001       1.000         28.733   0.1
 Neighbor search        1    8        251       5.410        155.445   0.4
 Launch GPU ops.        1    8      10001    1353.401      38888.866  89.6
 Force                  1    8      10001       6.941        199.443   0.5
 PME mesh               1    8      10001     121.314       3485.853   8.0
 Wait GPU NB local      1    8      10001       0.425         12.220   0.0
 NB X/F buffer ops.     1    8      19751       6.577        188.988   0.4
 Vsite spread           1    8      10102       1.572         45.169   0.1
 Write traj.            1    8          2       0.505         14.506   0.0
 Update                 1    8      10001       5.537        159.111   0.4
 Constraints            1    8      10003       6.719        193.054   0.4
 Rest                                           0.691         19.863   0.0
-----------------------------------------------------------------------------
 Total                                       1510.092      43391.251 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1    8      10001      40.884       1174.769   2.7
 PME gather             1    8      10001      29.190        838.744   1.9
 PME 3D-FFT             1    8      20002      48.085       1381.693   3.2
 PME solve Elec         1    8      10001       3.060         87.920   0.2
-----------------------------------------------------------------------------

 GPU timings
-----------------------------------------------------------------------------
 Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
 Pair list H2D                        251       0.377        1.500     0.0
 X / q H2D                          10001      18.341        1.834     0.0
 Nonbonded F kernel                  990055340232448.731   5589922469
50.0
 Nonbonded F+ene k.                   101      62.366      617.484     0.0
 Pruning kernel                       251       6.736       26.836     0.0
 F D2H                              1000155340232519.377   5533469904
50.0
-----------------------------------------------------------------------------
 Total                                   110680465055.927   1106693981
 100.0
-----------------------------------------------------------------------------
 *Dynamic pruning                    4750      25.872        5.447     0.0
-----------------------------------------------------------------------------

Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
ms/12.824 ms = 862973673.837
For optimal resource utilization this ratio should be close to 1

NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
      CPU resources.

               Core t (s)   Wall t (s)        (%)
       Time:    12080.732     1510.092      800.0
                 (ns/day)    (hour/ns)
Performance:        2.861        8.389

[ LOG FILE ENDS ]

Cheers from an unbearably hot São Paulo,
--
Elton Carvalho
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org<mailto:gmx-users-request at gromacs.org>.