[gmx-users] NVIDIA GTX cards in Rackable servers, how do you do it ?
pall.szilard at gmail.com
Tue Feb 24 17:19:01 CET 2015
On Tue, Feb 24, 2015 at 3:44 PM, Téletchéa Stéphane
<stephane.teletchea at univ-nantes.fr> wrote:
> Le 24/02/2015 13:29, David McGiven a écrit :
>> I never benchmarked 64-core AMD nodes with GPUs. With a 80 k atoms test
>>> >system using a 2 fs time step I get
>>> >24 ns/d on 64 AMD cores 6272
>>> >16 ns/d on 32 AMD cores 6380
>>> >36 ns/d on 32 AMD cores 6380 with 1x GTX 980
>>> >40 ns/d on 32 AMD cores 6380 with 2x GTX 980
>>> >27 ns/d on 20 Intel cores 2680v2
>>> >52 ns/d on 20 Intel cores 2680v2 with 1x GTX 980
>>> >62 ns/d on 20 Intel cores 2680v2 with 2x GTX 980
>> I think 20 Intel cores means 2 x 10 cores each.
>> But Szilard just mentioned in this same thread :
>> If you can afford them get the 14/16 or 18 core v3 Haswells, those are
>>> >*really* fast, but a pair can cost as much as a decent car.
>> I know for sure gromacs escalates VERY well on 4 x 16 cores latests AMD
>> (Interlagos, Bulldozer, etc.) machines. But have no experience with Intel
> My experience with latest gromacs and fftw build on my machine is that
> one should not consider the "hyperthreaded" "cores" , but only the real
> My system has 24 "cores" (E5-2620 v2 @ 2.10GHz + NVIDIA K4000), but really
> only 12 "real" cores.
> Using pin, running only one test system with optimized conditions I used the
> available at the gromacs web site (ADH, rnase, villin,
> My results were :
> *** rnase_cubic
> 45,75 ns/day with -nt 6 and gpu on
> 47,10 ns/day with -nt 12 and gpu on
> 27,66 ns/day with -nt 24 and gpu on
> 35,31 ns/day with -nt 12 and gpu off
> 21,37 ns/day with -nt 24 and gpu off
> The results are more or less similar in the other benchmarks, 6 cores + GPU
> close to 12 cores + GPU, and faster than 24 cores...
> The difference in the GPU case is the aveage GPU usage, which is more than
> 85 % during the tests runs when not all processors are in use while it drops
> to 50 % if all cores are in use (using a rough observation of the GPU usage
> using nvidia-smi-tool).
> I have no explanation for the CPU-only benchmarked though, since I have
> enabled or disabled pinning, ensured that only one job was running at a
> time, etc. I have not played a lot with -nt, either omp or mpi, since this
> machine is a single node.
> Hope this helps in showing that "more expensive" may not be the way...
Thanks! Let me note that those observations are particular to your
machine. There are multiple factors that cumulatively affect the
- physical vs "HT" threads
- crossing socket boundaries
- iteration time/data per thread
- GPU and GPU performance
In your case all these three factors are somewhat disadvantageous for
good scaling. You have two sockets so your runs are crossing CPU
socket boundaries. The input is quite small and with GPUs the
HyperThreading disadvatages can increase - especially with a slow GPU.
- your Quadro 4000 can likely not keep up with the 12 CPU cores and
there is probably some "Wait GPU" time (see log file)
- if you want to test 1 CPU + 1 GPU using HT vs not using it you
should run make sure to run with "-pinstride 1 -ntomp 12" in the
- "-nt" is partially deprecated/backward compatibility flag and should
only be used if its meaning is "use this many tMPI or OpenMP threads
and decide which one is better", which is not the case here!
> Lecturer, UFIP, UMR 6286 CNRS, Team Protein Design In Silico
> UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes
> cedex 03, France
> Tél : +33 251 125 636 / Fax : +33 251 125 632
> http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> Gromacs Users mailing list
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users