[gmx-users] NVIDIA GTX cards in Rackable servers, how do you do it ?

Téletchéa Stéphane stephane.teletchea at univ-nantes.fr
Tue Feb 24 16:02:04 CET 2015

Le 24/02/2015 13:29, David McGiven a écrit :
> I never benchmarked 64-core AMD nodes with GPUs. With a 80 k atoms test
>> >system using a 2 fs time step I get
>> >24 ns/d on 64 AMD   cores 6272
>> >16 ns/d on 32 AMD   cores 6380
>> >36 ns/d on 32 AMD   cores 6380   with 1x GTX 980
>> >40 ns/d on 32 AMD   cores 6380   with 2x GTX 980
>> >27 ns/d on 20 Intel cores 2680v2
>> >52 ns/d on 20 Intel cores 2680v2 with 1x GTX 980
>> >62 ns/d on 20 Intel cores 2680v2 with 2x GTX 980
> I think 20 Intel cores means 2 x 10 cores each.
> But Szilard just mentioned in this same thread :
> If you can afford them get the 14/16 or 18 core v3 Haswells, those are
>> >*really*  fast, but a pair can cost as much as a decent car.
> I know for sure gromacs escalates VERY well on 4 x 16 cores latests AMD
> (Interlagos, Bulldozer, etc.) machines. But have no experience with Intel
> Xeon.

My experience with latest gromacs and fftw build on my machine is that
one should not consider the "hyperthreaded" "cores" , but only the real 

My system has 24 "cores" (E5-2620 v2 @ 2.10GHz + NVIDIA K4000), but 
really only 12 "real" cores.

Using pin, running only one test system with optimized conditions I used 
the benchmarks
available at the gromacs web site (ADH, rnase, villin, 

My results were :

*** rnase_cubic
45,75 ns/day with -nt  6 and gpu on
47,10 ns/day with -nt 12 and gpu on
27,66 ns/day with -nt 24 and gpu on
35,31 ns/day with -nt 12 and gpu off
21,37 ns/day with -nt 24 and gpu off

The results are more or less similar in the other benchmarks, 6 cores + 
GPU close to 12 cores + GPU, and faster than 24 cores...

The difference in the GPU case is the aveage GPU usage, which is more 
than 85 % during the tests runs when not all processors are in use while 
it drops to 50 % if all cores are in use (using a rough observation of 
the GPU usage using nvidia-smi-tool).

I have no explanation for the CPU-only benchmarked though, since I have 
enabled or disabled pinning, ensured that only one job was running at a 
time, etc. I have not played a lot with -nt, either omp or mpi, since 
this machine is a single node.

Hope this helps in showing that "more expensive" may not be the way...



Lecturer, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes cedex 03, France
Tél : +33 251 125 636 / Fax : +33 251 125 632
http://www.ufip.univ-nantes.fr/ - http://www.steletch.org

More information about the gromacs.org_gmx-users mailing list