[gmx-users] Nonsense timing accounting with OpenCL in Haswell
Elton Carvalho
eltonfc at gmail.com
Fri Jan 25 22:12:26 CET 2019
Hi, Roland,
Thank you for your response. If the GT2 GPU won't give much acceleration,
I'm not sure that it's worth to fix the timing issue in beignet, since it's
deprecated. Not much sense in trying to make a deprecated driver work so an
ineffective GPU can be used.
I'll be testing a GTX 1050Ti in an identical machine soon, if the benchmark
goes well, I'll buy the same card for the workstation. (I'm limited to the
1050Ti because of the small form factor workstation)
Thanks again, from a windy and raini São paulo,
Elton
On Thu, Jan 24, 2019 at 7:27 AM Schulz, Roland <roland.schulz at intel.com>
wrote:
> Hi Elton,
>
> It is very unlikely that you would be able to get any speedup with OpenCL
> with that CPU. For its generation the CPU is powerful but the GPU is a
> lower tier. To get meaningful speedup compared to running on i7 CPU you
> want to have a GT3 or higher GPU with at least 48EUs. Yours has 20EUs.
> That the timing is nonsense might be caused by beignet. We haven’t tested
> beignet with GROMACS.
> Other than for the GPU/CPU load balancing (which you could do manually)
> the timing shouldn’t affect the performance. So you can try runs with and
> without GPUs to determine what the speedup/slowdown is of using the GPU.
> If you are interested to find out what exactly goes wrong with the timing,
> you might want to look at getLastRangeTime in
> src/gromacs/gpu_utils/gpuregiontimer_ocl.h. Maybe add some extra debug
> printing to see whether you notice a few abnormal timings or whether all
> are nonsensical.
>
> Roland
>
>
> From: Elton Carvalho <eltonfc at gmail.com<mailto:eltonfc at gmail.com>>
> Date: Thu, 24 Jan 2019 at 04:06
> Subject: [gmx-users] Nonsense timing accounting with OpenCL in Haswell
> To: Discussion list for GROMACS users <gmx-users at gromacs.org<mailto:
> gmx-users at gromacs.org>>
>
>
> Greetings!
>
> I'm trying to set up gromacs-2019 to use OpenCL with my Intel GPU,
> integrated in a Haswell processor. The log file says it's detected as
>
> Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel, device version:
> OpenCL 1.2 beignet 1.3, stat: compatible
>
> I'm running beignet as the OpenCL driver because the NEO drivers don't seem
> to support Haswell.
>
> Testing with the "NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water" benchmark,
> as available in the "adh_cubic_vsites" directory, I get *MUCH* slower
> performance and some really weird timings at the end of the logfile. Things
> like:
>
> 1) "Launch GPU ops." taking almost 90% of the run time
> 2) "Nonbonded F kernel" in the GPU times with nonsense readings such as "
> 5589922469 ms/step" in a 30-minute test run.
> 3) 110680465055.927 seconds of Total walltime in the GPU in a 30-minute
> real-time run.
>
> My questions are:
> a) Could this nonsense timing be coming from beignet, which is not _really_
> supported? If not, wherefrom?
> b) How can I troubleshoot this and get sensible timings to decide whether
> using OpenCL in this machine is even worth it?
> c) Is Intel OpenCL even wirth it in a Haswell machine ( i7-4790)? :)
>
> The whole log is available at
> https://gist.github.com/eltonfc/dd8755bce756b627464df70faa9d3bab and the
> relevant parts are below:
>
> [ LOG FILE BEGINS ]
> Command line:
> gmx mdrun -v -maxh .5 -notunepme
>
> GROMACS version: 2019
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support: OpenCL
> SIMD instructions: AVX2_256
> FFT library: fftw-3.3.6-pl2-fma-sse2-avx-avx2-avx2_128
> RDTSCP usage: enabled
> TNG support: enabled
> Hwloc support: hwloc-1.11.2
> Tracing support: disabled
> C compiler: /usr/bin/cc GNU 7.3.0
> C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler: /usr/bin/c++ GNU 7.3.0
> C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> OpenCL include dir: /usr/include
> OpenCL library: /usr/lib/libOpenCL.so
> OpenCL version: 2.0
>
>
> Running on 1 node with total 4 cores, 8 logical cores, 1 compatible GPU
> Hardware detected:
> CPU info:
> Vendor: Intel
> Brand: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> Family: 6 Model: 60 Stepping: 3
> Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel
> lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Hardware topology: Full, with devices
> Sockets, cores, and logical processors:
> Socket 0: [ 0 4] [ 1 5] [ 2 6] [ 3 7]
> Numa nodes:
> Node 0 (16704245760 bytes mem): 0 1 2 3 4 5 6 7
> Latency:
> 0
> 0 1.00
> Caches:
> L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
> L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 2 ways
> L3: 8388608 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
> PCI devices:
> 0000:00:02.0 Id: 8086:0412 Class: 0x0300 Numa: 0
> 0000:00:19.0 Id: 8086:153a Class: 0x0200 Numa: 0
> 0000:00:1f.2 Id: 8086:8c02 Class: 0x0106 Numa: 0
> GPU info:
> Number of GPUs detected: 1
> #0: name: Intel(R) HD Graphics Haswell GT2 Desktop, vendor: Intel,
> device version: OpenCL 1.2 beignet 1.3, stat: compatible
>
> [... skipping ... ]
>
> Changing rlist from 0.935 to 0.956 for non-bonded 4x2 atom kernels
>
> Changing nstlist from 10 to 40, rlist from 0.956 to 1.094
>
> Using 1 MPI thread
> Using 8 OpenMP threads
>
> 1 GPU auto-selected for this run.
> Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
> PP:0
> PP tasks will do (non-perturbed) short-ranged interactions on the GPU
> Pinning threads with an auto-selected logical core stride of 1
> System total charge: 0.000
> Will do PME sum in reciprocal space for electrostatic interactions.
>
> [... skipping ...]
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> Computing: M-Number M-Flops % Flops
>
> -----------------------------------------------------------------------------
> Pair Search distance check 4723.039760 42507.358 0.1
> NxN Ewald Elec. + LJ [F] 793216.982880 52352320.870 90.7
> NxN Ewald Elec. + LJ [V&F] 8092.108144 865855.571 1.5
> 1,4 nonbonded interactions 562.216216 50599.459 0.1
> Calc Weights 4087.128672 147136.632 0.3
> Spread Q Bspline 87192.078336 174384.157 0.3
> Gather F Bspline 87192.078336 523152.470 0.9
> 3D-FFT 398671.243138 3189369.945 5.5
> Solve PME 100.010000 6400.640 0.0
> Shift-X 34.192224 205.153 0.0
> Angles 325.472544 54679.387 0.1
> Propers 548.454840 125596.158 0.2
> Impropers 40.564056 8437.324 0.0
> Virial 13.763169 247.737 0.0
> Stop-CM 13.894848 138.948 0.0
> Calc-Ekin 272.720448 7363.452 0.0
> Lincs 131.439420 7886.365 0.0
> Lincs-Mat 3692.627456 14770.510 0.0
> Constraint-V 1391.078160 11128.625 0.0
> Constraint-Vir 12.719940 305.279 0.0
> Settle 376.112800 121484.434 0.2
> Virtual Site 3 21.012160 777.450 0.0
> Virtual Site 3fd 19.880736 1888.670 0.0
> Virtual Site 3fad 3.313456 583.168 0.0
> Virtual Site 3out 57.217728 4977.942 0.0
> Virtual Site 4fdn 16.486464 4187.562 0.0
>
> -----------------------------------------------------------------------------
> Total 57716385.269 100.0
>
> -----------------------------------------------------------------------------
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 8 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
>
> -----------------------------------------------------------------------------
> Vsite constr. 1 8 10001 1.000 28.733 0.1
> Neighbor search 1 8 251 5.410 155.445 0.4
> Launch GPU ops. 1 8 10001 1353.401 38888.866 89.6
> Force 1 8 10001 6.941 199.443 0.5
> PME mesh 1 8 10001 121.314 3485.853 8.0
> Wait GPU NB local 1 8 10001 0.425 12.220 0.0
> NB X/F buffer ops. 1 8 19751 6.577 188.988 0.4
> Vsite spread 1 8 10102 1.572 45.169 0.1
> Write traj. 1 8 2 0.505 14.506 0.0
> Update 1 8 10001 5.537 159.111 0.4
> Constraints 1 8 10003 6.719 193.054 0.4
> Rest 0.691 19.863 0.0
>
> -----------------------------------------------------------------------------
> Total 1510.092 43391.251 100.0
>
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
> PME spread 1 8 10001 40.884 1174.769 2.7
> PME gather 1 8 10001 29.190 838.744 1.9
> PME 3D-FFT 1 8 20002 48.085 1381.693 3.2
> PME solve Elec 1 8 10001 3.060 87.920 0.2
>
> -----------------------------------------------------------------------------
>
> GPU timings
>
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
>
> -----------------------------------------------------------------------------
> Pair list H2D 251 0.377 1.500 0.0
> X / q H2D 10001 18.341 1.834 0.0
> Nonbonded F kernel 990055340232448.731 5589922469
> 50.0
> Nonbonded F+ene k. 101 62.366 617.484 0.0
> Pruning kernel 251 6.736 26.836 0.0
> F D2H 1000155340232519.377 5533469904
> 50.0
>
> -----------------------------------------------------------------------------
> Total 110680465055.927 1106693981
> 100.0
>
> -----------------------------------------------------------------------------
> *Dynamic pruning 4750 25.872 5.447 0.0
>
> -----------------------------------------------------------------------------
>
> Average per-step force GPU/CPU evaluation time ratio: 11066939811.612
> ms/12.824 ms = 862973673.837
> For optimal resource utilization this ratio should be close to 1
>
> NOTE: The GPU has >25% more load than the CPU. This imbalance wastes
> CPU resources.
>
> Core t (s) Wall t (s) (%)
> Time: 12080.732 1510.092 800.0
> (ns/day) (hour/ns)
> Performance: 2.861 8.389
>
> [ LOG FILE ENDS ]
>
> Cheers from an unbearably hot São Paulo,
> --
> Elton Carvalho
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org<mailto:
> gmx-users-request at gromacs.org>.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
--
Elton Carvalho
More information about the gromacs.org_gmx-users
mailing list