[gmx-users] NVIDIA GTX cards in Rackable servers, how do you do it ?

Tue Feb 24 16:02:04 CET 2015

Le 24/02/2015 13:29, David McGiven a écrit :
> I never benchmarked 64-core AMD nodes with GPUs. With a 80 k atoms test
>> >system using a 2 fs time step I get
>> >24 ns/d on 64 AMD   cores 6272
>> >16 ns/d on 32 AMD   cores 6380
>> >36 ns/d on 32 AMD   cores 6380   with 1x GTX 980
>> >40 ns/d on 32 AMD   cores 6380   with 2x GTX 980
>> >27 ns/d on 20 Intel cores 2680v2
>> >52 ns/d on 20 Intel cores 2680v2 with 1x GTX 980
>> >62 ns/d on 20 Intel cores 2680v2 with 2x GTX 980
> I think 20 Intel cores means 2 x 10 cores each.
>
> But Szilard just mentioned in this same thread :
>
> If you can afford them get the 14/16 or 18 core v3 Haswells, those are
>> >*really*  fast, but a pair can cost as much as a decent car.
>
> I know for sure gromacs escalates VERY well on 4 x 16 cores latests AMD
> (Interlagos, Bulldozer, etc.) machines. But have no experience with Intel
> Xeon.

My experience with latest gromacs and fftw build on my machine is that
one should not consider the "hyperthreaded" "cores" , but only the real 
cores.

My system has 24 "cores" (E5-2620 v2 @ 2.10GHz + NVIDIA K4000), but 
really only 12 "real" cores.

Using pin, running only one test system with optimized conditions I used 
the benchmarks
available at the gromacs web site (ADH, rnase, villin, 
http://www.gromacs.org/GPU_acceleration),

My results were :

*** rnase_cubic
45,75 ns/day with -nt  6 and gpu on
47,10 ns/day with -nt 12 and gpu on
27,66 ns/day with -nt 24 and gpu on
35,31 ns/day with -nt 12 and gpu off
21,37 ns/day with -nt 24 and gpu off

The results are more or less similar in the other benchmarks, 6 cores + 
GPU close to 12 cores + GPU, and faster than 24 cores...

The difference in the GPU case is the aveage GPU usage, which is more 
than 85 % during the tests runs when not all processors are in use while 
it drops to 50 % if all cores are in use (using a rough observation of 
the GPU usage using nvidia-smi-tool).

I have no explanation for the CPU-only benchmarked though, since I have 
enabled or disabled pinning, ensured that only one job was running at a 
time, etc. I have not played a lot with -nt, either omp or mpi, since 
this machine is a single node.

Hope this helps in showing that "more expensive" may not be the way...

Best,

Stéphane

-- 
Lecturer, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes cedex 03, France
Tél : +33 251 125 636 / Fax : +33 251 125 632
http://www.ufip.univ-nantes.fr/ - http://www.steletch.org