[gmx-users] Nonsense timing accounting with OpenCL in Haswell

Thu Jan 24 04:05:45 CET 2019

Greetings!

I'm trying to set up  gromacs-2019 to use OpenCL with my Intel GPU,
integrated in a Haswell processor. The log file says it's detected as

Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
OpenCL 1.2 beignet 1.3, stat: compatible

I'm running beignet as the OpenCL driver because the NEO drivers don't seem
to support Haswell.

Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
performance and some really weird timings at the end of the logfile. Things
like:

1)  "Launch GPU ops." taking almost 90% of the run time
2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
5589922469 ms/step" in a 30-minute test run.
3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
real-time run.

My questions are:
a) Could this nonsense timing be coming from beignet, which is not _really_
supported? If not, wherefrom?
b) How can I troubleshoot this and get sensible timings to decide whether
using OpenCL in this machine is even worth it?
c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)

The whole log is available at
https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
relevant parts are below:

[ LOG FILE BEGINS ]
Command line:
  gmx mdrun -v -maxh .5 -notunepme

GROMACS version:    2019
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        OpenCL
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.2
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.3.0
C compiler flags:    -mavx2 -mfma     -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:       /usr/bin/c++ GNU 7.3.0
C++ compiler flags:  -mavx2 -mfma    -std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /usr/include
OpenCL library:     /usr/lib/libOpenCL.so
OpenCL version:     2.0

Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
    Family: 6   Model: 60   Stepping: 3
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Full, with devices
    Sockets, cores, and logical processors:
      Socket  0: [   0   4] [   1   5] [   2   6] [   3   7]
    Numa nodes:
      Node  0 (16704245760 bytes mem):   0   1   2   3   4   5   6   7
      Latency:
               0
         0  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
      L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
    PCI devices:
      0000:00:02.0  Id: 8086:0412  Class: 0x0300  Numa: 0
      0000:00:19.0  Id: 8086:153a  Class: 0x0200  Numa: 0
      0000:00:1f.2  Id: 8086:8c02  Class: 0x0106  Numa: 0
  GPU info:
    Number of GPUs detected: 1
    #0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
device version: OpenCL 1.2 beignet 1.3, stat: compatible

[... skipping ... ]

Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels

Changing nstlist from 10 to 40, rlist from 0.956 to 1.094

Using 1 MPI thread
Using 8 OpenMP threads

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

 [... skipping ...]
M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            4723.039760       42507.358     0.1
 NxN Ewald Elec. + LJ [F]            793216.982880    52352320.870    90.7
 NxN Ewald Elec. + LJ [V&F]            8092.108144      865855.571     1.5
 1,4 nonbonded interactions             562.216216       50599.459     0.1
 Calc Weights                          4087.128672      147136.632     0.3
 Spread Q Bspline                     87192.078336      174384.157     0.3
 Gather F Bspline                     87192.078336      523152.470     0.9
 3D-FFT                              398671.243138     3189369.945     5.5
 Solve PME                              100.010000        6400.640     0.0
 Shift-X                                 34.192224         205.153     0.0
 Angles                                 325.472544       54679.387     0.1
 Propers                                548.454840      125596.158     0.2
 Impropers                               40.564056        8437.324     0.0
 Virial                                  13.763169         247.737     0.0
 Stop-CM                                 13.894848         138.948     0.0
 Calc-Ekin                              272.720448        7363.452     0.0
 Lincs                                  131.439420        7886.365     0.0
 Lincs-Mat                             3692.627456       14770.510     0.0
 Constraint-V                          1391.078160       11128.625     0.0
 Constraint-Vir                          12.719940         305.279     0.0
 Settle                                 376.112800      121484.434     0.2
 Virtual Site 3                          21.012160         777.450     0.0
 Virtual Site 3fd                        19.880736        1888.670     0.0
 Virtual Site 3fad                        3.313456         583.168     0.0
 Virtual Site 3out                       57.217728        4977.942     0.0
 Virtual Site 4fdn                       16.486464        4187.562     0.0
-----------------------------------------------------------------------------
 Total                                                57716385.269   100.0
-----------------------------------------------------------------------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Vsite constr.          1    8      10001       1.000         28.733   0.1
 Neighbor search        1    8        251       5.410        155.445   0.4
 Launch GPU ops.        1    8      10001    1353.401      38888.866  89.6
 Force                  1    8      10001       6.941        199.443   0.5
 PME mesh               1    8      10001     121.314       3485.853   8.0
 Wait GPU NB local      1    8      10001       0.425         12.220   0.0
 NB X/F buffer ops.     1    8      19751       6.577        188.988   0.4
 Vsite spread           1    8      10102       1.572         45.169   0.1
 Write traj.            1    8          2       0.505         14.506   0.0
 Update                 1    8      10001       5.537        159.111   0.4
 Constraints            1    8      10003       6.719        193.054   0.4
 Rest                                           0.691         19.863   0.0
-----------------------------------------------------------------------------
 Total                                       1510.092      43391.251 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1    8      10001      40.884       1174.769   2.7
 PME gather             1    8      10001      29.190        838.744   1.9
 PME 3D-FFT             1    8      20002      48.085       1381.693   3.2
 PME solve Elec         1    8      10001       3.060         87.920   0.2
-----------------------------------------------------------------------------

 GPU timings
-----------------------------------------------------------------------------
 Computing:                         Count  Wall t (s)      ms/step       %
-----------------------------------------------------------------------------
 Pair list H2D                        251       0.377        1.500     0.0
 X / q H2D                          10001      18.341        1.834     0.0
 Nonbonded F kernel                  990055340232448.731   5589922469
50.0
 Nonbonded F+ene k.                   101      62.366      617.484     0.0
 Pruning kernel                       251       6.736       26.836     0.0
 F D2H                              1000155340232519.377   5533469904
50.0
-----------------------------------------------------------------------------
 Total                                   110680465055.927   1106693981
 100.0
-----------------------------------------------------------------------------
 *Dynamic pruning                    4750      25.872        5.447     0.0
-----------------------------------------------------------------------------

Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
ms/12.824 ms = 862973673.837
For optimal resource utilization this ratio should be close to 1

NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
      CPU resources.

               Core t (s)   Wall t (s)        (%)
       Time:    12080.732     1510.092      800.0
                 (ns/day)    (hour/ns)
Performance:        2.861        8.389

[ LOG FILE ENDS ]

Cheers from an unbearably hot São Paulo,
-- 
Elton Carvalho