[gmx-users] Nonsense timing accounting with OpenCL in Haswell
Elton Carvalho
eltonfc at gmail.com
Thu Jan 24 04:05:45 CET 2019
Greetings!
I'm trying to set up gromacs-2019 to use OpenCL with my Intel GPU,
integrated in a Haswell processor. The log file says it's detected as
Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
OpenCL 1.2 beignet 1.3, stat: compatible
I'm running beignet as the OpenCL driver because the NEO drivers don't seem
to support Haswell.
Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
performance and some really weird timings at the end of the logfile. Things
like:
1) "Launch GPU ops." taking almost 90% of the run time
2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
5589922469 ms/step" in a 30-minute test run.
3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
real-time run.
My questions are:
a) Could this nonsense timing be coming from beignet, which is not _really_
supported? If not, wherefrom?
b) How can I troubleshoot this and get sensible timings to decide whether
using OpenCL in this machine is even worth it?
c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)
The whole log is available at
https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
relevant parts are below:
[ LOG FILE BEGINS ]
Command line:
gmx mdrun -v -maxh .5 -notunepme
GROMACS version: 2019
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: OpenCL
SIMD instructions: AVX2_256
FFT library: fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.2
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /usr/include
OpenCL library: /usr/lib/libOpenCL.so
OpenCL version: 2.0
Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Family: 6 Model: 60 Stepping: 3
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0 4] [ 1 5] [ 2 6] [ 3 7]
Numa nodes:
Node 0 (16704245760 bytes mem): 0 1 2 3 4 5 6 7
Latency:
0
0 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0000:00:02.0 Id: 8086:0412 Class: 0x0300 Numa: 0
0000:00:19.0 Id: 8086:153a Class: 0x0200 Numa: 0
0000:00:1f.2 Id: 8086:8c02 Class: 0x0106 Numa: 0
GPU info:
Number of GPUs detected: 1
#0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
device version: OpenCL 1.2 beignet 1.3, stat: compatible
[... skipping ... ]
Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels
Changing nstlist from 10 to 40, rlist from 0.956 to 1.094
Using 1 MPI thread
Using 8 OpenMP threads
1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
[... skipping ...]
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 4723.039760 42507.358 0.1
NxN Ewald Elec. + LJ [F] 793216.982880 52352320.870 90.7
NxN Ewald Elec. + LJ [V&F] 8092.108144 865855.571 1.5
1,4 nonbonded interactions 562.216216 50599.459 0.1
Calc Weights 4087.128672 147136.632 0.3
Spread Q Bspline 87192.078336 174384.157 0.3
Gather F Bspline 87192.078336 523152.470 0.9
3D-FFT 398671.243138 3189369.945 5.5
Solve PME 100.010000 6400.640 0.0
Shift-X 34.192224 205.153 0.0
Angles 325.472544 54679.387 0.1
Propers 548.454840 125596.158 0.2
Impropers 40.564056 8437.324 0.0
Virial 13.763169 247.737 0.0
Stop-CM 13.894848 138.948 0.0
Calc-Ekin 272.720448 7363.452 0.0
Lincs 131.439420 7886.365 0.0
Lincs-Mat 3692.627456 14770.510 0.0
Constraint-V 1391.078160 11128.625 0.0
Constraint-Vir 12.719940 305.279 0.0
Settle 376.112800 121484.434 0.2
Virtual Site 3 21.012160 777.450 0.0
Virtual Site 3fd 19.880736 1888.670 0.0
Virtual Site 3fad 3.313456 583.168 0.0
Virtual Site 3out 57.217728 4977.942 0.0
Virtual Site 4fdn 16.486464 4187.562 0.0
-----------------------------------------------------------------------------
Total 57716385.269 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Vsite constr. 1 8 10001 1.000 28.733 0.1
Neighbor search 1 8 251 5.410 155.445 0.4
Launch GPU ops. 1 8 10001 1353.401 38888.866 89.6
Force 1 8 10001 6.941 199.443 0.5
PME mesh 1 8 10001 121.314 3485.853 8.0
Wait GPU NB local 1 8 10001 0.425 12.220 0.0
NB X/F buffer ops. 1 8 19751 6.577 188.988 0.4
Vsite spread 1 8 10102 1.572 45.169 0.1
Write traj. 1 8 2 0.505 14.506 0.0
Update 1 8 10001 5.537 159.111 0.4
Constraints 1 8 10003 6.719 193.054 0.4
Rest 0.691 19.863 0.0
-----------------------------------------------------------------------------
Total 1510.092 43391.251 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 8 10001 40.884 1174.769 2.7
PME gather 1 8 10001 29.190 838.744 1.9
PME 3D-FFT 1 8 20002 48.085 1381.693 3.2
PME solve Elec 1 8 10001 3.060 87.920 0.2
-----------------------------------------------------------------------------
GPU timings
-----------------------------------------------------------------------------
Computing: Count Wall t (s) ms/step %
-----------------------------------------------------------------------------
Pair list H2D 251 0.377 1.500 0.0
X / q H2D 10001 18.341 1.834 0.0
Nonbonded F kernel 990055340232448.731 5589922469
50.0
Nonbonded F+ene k. 101 62.366 617.484 0.0
Pruning kernel 251 6.736 26.836 0.0
F D2H 1000155340232519.377 5533469904
50.0
-----------------------------------------------------------------------------
Total 110680465055.927 1106693981
100.0
-----------------------------------------------------------------------------
*Dynamic pruning 4750 25.872 5.447 0.0
-----------------------------------------------------------------------------
Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
ms/12.824 ms = 862973673.837
For optimal resource utilization this ratio should be close to 1
NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
CPU resources.
Core t (s) Wall t (s) (%)
Time: 12080.732 1510.092 800.0
(ns/day) (hour/ns)
Performance: 2.861 8.389
[ LOG FILE ENDS ]
Cheers from an unbearably hot São Paulo,
--
Elton Carvalho
More information about the gromacs.org_gmx-users
mailing list