[gmx-users] Nonsense timing accounting with OpenCL in Haswell
Schulz, Roland
roland.schulz at intel.com
Thu Jan 24 10:27:17 CET 2019
Hi Elton,
It is very unlikely that you would be able to get any speedup with OpenCL with that CPU. For its generation the CPU is powerful but the GPU is a lower tier. To get meaningful speedup compared to running on i7 CPU you want to have a GT3 or higher GPU with at least 48EUs. Yours has 20EUs.
That the timing is nonsense might be caused by beignet. We haven’t tested beignet with GROMACS.
Other than for the GPU/CPU load balancing (which you could do manually) the timing shouldn’t affect the performance. So you can try runs with and without GPUs to determine what the speedup/slowdown is of using the GPU.
If you are interested to find out what exactly goes wrong with the timing, you might want to look at getLastRangeTime in src/gromacs/gpu_utils/gpuregiontimer_ocl.h. Maybe add some extra debug printing to see whether you notice a few abnormal timings or whether all are nonsensical.
Roland
From: Elton Carvalho <eltonfc at gmail.com<mailto:eltonfc at gmail.com>>
Date: Thu, 24 Jan 2019 at 04:06
Subject: [gmx-users] Nonsense timing accounting with OpenCL in Haswell
To: Discussion list for GROMACS users <gmx-users at gromacs.org<mailto:gmx-users at gromacs.org>>
Greetings!
I'm trying to set up gromacs-2019 to use OpenCL with my Intel GPU,
integrated in a Haswell processor. The log file says it's detected as
Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
OpenCL 1.2 beignet 1.3, stat: compatible
I'm running beignet as the OpenCL driver because the NEO drivers don't seem
to support Haswell.
Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
performance and some really weird timings at the end of the logfile. Things
like:
1) "Launch GPU ops." taking almost 90% of the run time
2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
5589922469 ms/step" in a 30-minute test run.
3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
real-time run.
My questions are:
a) Could this nonsense timing be coming from beignet, which is not _really_
supported? If not, wherefrom?
b) How can I troubleshoot this and get sensible timings to decide whether
using OpenCL in this machine is even worth it?
c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)
The whole log is available at
https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
relevant parts are below:
[ LOG FILE BEGINS ]
Command line:
gmx mdrun -v -maxh .5 -notunepme
GROMACS version: 2019
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: OpenCL
SIMD instructions: AVX2_256
FFT library: fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.2
Tracing support: disabled
C compiler: /usr/bin/cc GNU 7.3.0
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 7.3.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /usr/include
OpenCL library: /usr/lib/libOpenCL.so
OpenCL version: 2.0
Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Family: 6 Model: 60 Stepping: 3
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0 4] [ 1 5] [ 2 6] [ 3 7]
Numa nodes:
Node 0 (16704245760 bytes mem): 0 1 2 3 4 5 6 7
Latency:
0
0 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0000:00:02.0 Id: 8086:0412 Class: 0x0300 Numa: 0
0000:00:19.0 Id: 8086:153a Class: 0x0200 Numa: 0
0000:00:1f.2 Id: 8086:8c02 Class: 0x0106 Numa: 0
GPU info:
Number of GPUs detected: 1
#0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
device version: OpenCL 1.2 beignet 1.3, stat: compatible
[... skipping ... ]
Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels
Changing nstlist from 10 to 40, rlist from 0.956 to 1.094
Using 1 MPI thread
Using 8 OpenMP threads
1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
[... skipping ...]
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 4723.039760 42507.358 0.1
NxN Ewald Elec. + LJ [F] 793216.982880 52352320.870 90.7
NxN Ewald Elec. + LJ [V&F] 8092.108144 865855.571 1.5
1,4 nonbonded interactions 562.216216 50599.459 0.1
Calc Weights 4087.128672 147136.632 0.3
Spread Q Bspline 87192.078336 174384.157 0.3
Gather F Bspline 87192.078336 523152.470 0.9
3D-FFT 398671.243138 3189369.945 5.5
Solve PME 100.010000 6400.640 0.0
Shift-X 34.192224 205.153 0.0
Angles 325.472544 54679.387 0.1
Propers 548.454840 125596.158 0.2
Impropers 40.564056 8437.324 0.0
Virial 13.763169 247.737 0.0
Stop-CM 13.894848 138.948 0.0
Calc-Ekin 272.720448 7363.452 0.0
Lincs 131.439420 7886.365 0.0
Lincs-Mat 3692.627456 14770.510 0.0
Constraint-V 1391.078160 11128.625 0.0
Constraint-Vir 12.719940 305.279 0.0
Settle 376.112800 121484.434 0.2
Virtual Site 3 21.012160 777.450 0.0
Virtual Site 3fd 19.880736 1888.670 0.0
Virtual Site 3fad 3.313456 583.168 0.0
Virtual Site 3out 57.217728 4977.942 0.0
Virtual Site 4fdn 16.486464 4187.562 0.0
-----------------------------------------------------------------------------
Total 57716385.269 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Vsite constr. 1 8 10001 1.000 28.733 0.1
Neighbor search 1 8 251 5.410 155.445 0.4
Launch GPU ops. 1 8 10001 1353.401 38888.866 89.6
Force 1 8 10001 6.941 199.443 0.5
PME mesh 1 8 10001 121.314 3485.853 8.0
Wait GPU NB local 1 8 10001 0.425 12.220 0.0
NB X/F buffer ops. 1 8 19751 6.577 188.988 0.4
Vsite spread 1 8 10102 1.572 45.169 0.1
Write traj. 1 8 2 0.505 14.506 0.0
Update 1 8 10001 5.537 159.111 0.4
Constraints 1 8 10003 6.719 193.054 0.4
Rest 0.691 19.863 0.0
-----------------------------------------------------------------------------
Total 1510.092 43391.251 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread 1 8 10001 40.884 1174.769 2.7
PME gather 1 8 10001 29.190 838.744 1.9
PME 3D-FFT 1 8 20002 48.085 1381.693 3.2
PME solve Elec 1 8 10001 3.060 87.920 0.2
-----------------------------------------------------------------------------
GPU timings
-----------------------------------------------------------------------------
Computing: Count Wall t (s) ms/step %
-----------------------------------------------------------------------------
Pair list H2D 251 0.377 1.500 0.0
X / q H2D 10001 18.341 1.834 0.0
Nonbonded F kernel 990055340232448.731 5589922469
50.0
Nonbonded F+ene k. 101 62.366 617.484 0.0
Pruning kernel 251 6.736 26.836 0.0
F D2H 1000155340232519.377 5533469904
50.0
-----------------------------------------------------------------------------
Total 110680465055.927 1106693981
100.0
-----------------------------------------------------------------------------
*Dynamic pruning 4750 25.872 5.447 0.0
-----------------------------------------------------------------------------
Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
ms/12.824 ms = 862973673.837
For optimal resource utilization this ratio should be close to 1
NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
CPU resources.
Core t (s) Wall t (s) (%)
Time: 12080.732 1510.092 800.0
(ns/day) (hour/ns)
Performance: 2.861 8.389
[ LOG FILE ENDS ]
Cheers from an unbearably hot São Paulo,
--
Elton Carvalho
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org<mailto:gmx-users-request at gromacs.org>.
More information about the gromacs.org_gmx-users
mailing list