[gmx-users] Nonsense timing accounting with OpenCL in Haswell

Fri Jan 25 22:12:26 CET 2019

Hi, Roland,

Thank you for your response. If the GT2 GPU won't give much acceleration,
I'm not sure that it's worth to fix the timing issue in beignet, since it's
deprecated. Not much sense in trying to make a deprecated driver work so an
ineffective GPU can be used.

I'll be testing a GTX 1050Ti in an identical machine soon, if the benchmark
goes well, I'll buy the same card for the workstation. (I'm limited to the
1050Ti because of the small form factor workstation)

Thanks again, from a windy and raini São paulo,
Elton

On Thu, Jan 24, 2019 at 7:27 AM Schulz, Roland <roland.schulz at intel.com>
wrote:

> Hi Elton,
>
> It is very unlikely that you would be able to get any speedup with OpenCL
> with that CPU. For its generation the CPU is powerful but the GPU is a
> lower tier. To get meaningful speedup compared to running on i7 CPU you
> want to have a GT3 or higher GPU with at least 48EUs. Yours has 20EUs.
> That the timing is nonsense might be caused by beignet. We haven’t tested
> beignet with GROMACS.
> Other than for the GPU/CPU load balancing (which you could do manually)
> the timing shouldn’t affect the performance. So you can try runs with and
> without GPUs to determine what the speedup/slowdown is of using the GPU.
> If you are interested to find out what exactly goes wrong with the timing,
> you might want to look at getLastRangeTime in
> src/gromacs/gpu_utils/gpuregiontimer_ocl.h. Maybe add some extra debug
> printing to see whether you notice a few abnormal timings or whether all
> are nonsensical.
>
> Roland
>
>
> From: Elton Carvalho <eltonfc at gmail.com<mailto:eltonfc at gmail.com>>
> Date: Thu, 24 Jan 2019 at 04:06
> Subject: [gmx-users] Nonsense timing accounting with OpenCL in Haswell
> To: Discussion list for GROMACS users <gmx-users at gromacs.org<mailto:
> gmx-users at gromacs.org>>
>
>
> Greetings!
>
> I'm trying to set up  gromacs-2019 to use OpenCL with my Intel GPU,
> integrated in a Haswell processor. The log file says it's detected as
>
> Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
> OpenCL 1.2 beignet 1.3, stat: compatible
>
> I'm running beignet as the OpenCL driver because the NEO drivers don't seem
> to support Haswell.
>
> Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
> as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
> performance and some really weird timings at the end of the logfile. Things
> like:
>
> 1)  "Launch GPU ops." taking almost 90% of the run time
> 2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
> 5589922469 ms/step" in a 30-minute test run.
> 3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
> real-time run.
>
> My questions are:
> a) Could this nonsense timing be coming from beignet, which is not _really_
> supported? If not, wherefrom?
> b) How can I troubleshoot this and get sensible timings to decide whether
> using OpenCL in this machine is even worth it?
> c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)
>
> The whole log is available at
> https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
> relevant parts are below:
>
> [ LOG FILE BEGINS ]
> Command line:
>   gmx mdrun -v -maxh .5 -notunepme
>
> GROMACS version:    2019
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        OpenCL
> SIMD instructions:  AVX2_256
> FFT library:        fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      hwloc-1.11.2
> Tracing support:    disabled
> C compiler:         /usr/bin/cc GNU 7.3.0
> C compiler flags:    -mavx2 -mfma     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler:       /usr/bin/c++ GNU 7.3.0
> C++ compiler flags:  -mavx2 -mfma    -std=c++11   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> OpenCL include dir: /usr/include
> OpenCL library:     /usr/lib/libOpenCL.so
> OpenCL version:     2.0
>
>
> Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
>     Family: 6   Model: 60   Stepping: 3
>     Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
> lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>   Hardware topology: Full, with devices
>     Sockets, cores, and logical processors:
>       Socket  0: [   0   4] [   1   5] [   2   6] [   3   7]
>     Numa nodes:
>       Node  0 (16704245760 bytes mem):   0   1   2   3   4   5   6   7
>       Latency:
>                0
>          0  1.00
>     Caches:
>       L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
>       L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
>       L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
>     PCI devices:
>       0000:00:02.0  Id: 8086:0412  Class: 0x0300  Numa: 0
>       0000:00:19.0  Id: 8086:153a  Class: 0x0200  Numa: 0
>       0000:00:1f.2  Id: 8086:8c02  Class: 0x0106  Numa: 0
>   GPU info:
>     Number of GPUs detected: 1
>     #0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
> device version: OpenCL 1.2 beignet 1.3, stat: compatible
>
> [... skipping ... ]
>
> Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels
>
> Changing nstlist from 10 to 40, rlist from 0.956 to 1.094
>
> Using 1 MPI thread
> Using 8 OpenMP threads
>
> 1 GPU auto-selected for this run.
> Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
>   PP:0
> PP tasks will do (non-perturbed) short-ranged interactions on the GPU
> Pinning threads with an auto-selected logical core stride of 1
> System total charge: 0.000
> Will do PME sum in reciprocal space for electrostatic interactions.
>
>  [... skipping ...]
> M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:                               M-Number         M-Flops  % Flops
>
> -----------------------------------------------------------------------------
>  Pair Search distance check            4723.039760       42507.358     0.1
>  NxN Ewald Elec. + LJ [F]            793216.982880    52352320.870    90.7
>  NxN Ewald Elec. + LJ [V&F]            8092.108144      865855.571     1.5
>  1,4 nonbonded interactions             562.216216       50599.459     0.1
>  Calc Weights                          4087.128672      147136.632     0.3
>  Spread Q Bspline                     87192.078336      174384.157     0.3
>  Gather F Bspline                     87192.078336      523152.470     0.9
>  3D-FFT                              398671.243138     3189369.945     5.5
>  Solve PME                              100.010000        6400.640     0.0
>  Shift-X                                 34.192224         205.153     0.0
>  Angles                                 325.472544       54679.387     0.1
>  Propers                                548.454840      125596.158     0.2
>  Impropers                               40.564056        8437.324     0.0
>  Virial                                  13.763169         247.737     0.0
>  Stop-CM                                 13.894848         138.948     0.0
>  Calc-Ekin                              272.720448        7363.452     0.0
>  Lincs                                  131.439420        7886.365     0.0
>  Lincs-Mat                             3692.627456       14770.510     0.0
>  Constraint-V                          1391.078160       11128.625     0.0
>  Constraint-Vir                          12.719940         305.279     0.0
>  Settle                                 376.112800      121484.434     0.2
>  Virtual Site 3                          21.012160         777.450     0.0
>  Virtual Site 3fd                        19.880736        1888.670     0.0
>  Virtual Site 3fad                        3.313456         583.168     0.0
>  Virtual Site 3out                       57.217728        4977.942     0.0
>  Virtual Site 4fdn                       16.486464        4187.562     0.0
>
> -----------------------------------------------------------------------------
>  Total                                                57716385.269   100.0
>
> -----------------------------------------------------------------------------
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
>
> -----------------------------------------------------------------------------
>  Vsite constr.          1    8      10001       1.000         28.733   0.1
>  Neighbor search        1    8        251       5.410        155.445   0.4
>  Launch GPU ops.        1    8      10001    1353.401      38888.866  89.6
>  Force                  1    8      10001       6.941        199.443   0.5
>  PME mesh               1    8      10001     121.314       3485.853   8.0
>  Wait GPU NB local      1    8      10001       0.425         12.220   0.0
>  NB X/F buffer ops.     1    8      19751       6.577        188.988   0.4
>  Vsite spread           1    8      10102       1.572         45.169   0.1
>  Write traj.            1    8          2       0.505         14.506   0.0
>  Update                 1    8      10001       5.537        159.111   0.4
>  Constraints            1    8      10003       6.719        193.054   0.4
>  Rest                                           0.691         19.863   0.0
>
> -----------------------------------------------------------------------------
>  Total                                       1510.092      43391.251 100.0
>
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
>  PME spread             1    8      10001      40.884       1174.769   2.7
>  PME gather             1    8      10001      29.190        838.744   1.9
>  PME 3D-FFT             1    8      20002      48.085       1381.693   3.2
>  PME solve Elec         1    8      10001       3.060         87.920   0.2
>
> -----------------------------------------------------------------------------
>
>  GPU timings
>
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
>
> -----------------------------------------------------------------------------
>  Pair list H2D                        251       0.377        1.500     0.0
>  X / q H2D                          10001      18.341        1.834     0.0
>  Nonbonded F kernel                  990055340232448.731   5589922469
> 50.0
>  Nonbonded F+ene k.                   101      62.366      617.484     0.0
>  Pruning kernel                       251       6.736       26.836     0.0
>  F D2H                              1000155340232519.377   5533469904
> 50.0
>
> -----------------------------------------------------------------------------
>  Total                                   110680465055.927   1106693981
>  100.0
>
> -----------------------------------------------------------------------------
>  *Dynamic pruning                    4750      25.872        5.447     0.0
>
> -----------------------------------------------------------------------------
>
> Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
> ms/12.824 ms = 862973673.837
> For optimal resource utilization this ratio should be close to 1
>
> NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
>       CPU resources.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:    12080.732     1510.092      800.0
>                  (ns/day)    (hour/ns)
> Performance:        2.861        8.389
>
> [ LOG FILE ENDS ]
>
> Cheers from an unbearably hot São Paulo,
> --
> Elton Carvalho
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org<mailto:
> gmx-users-request at gromacs.org>.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.

-- 
Elton Carvalho