[gmx-developers] GPU acceleration on 48 core nodes

Thu May 24 16:26:24 CEST 2012

Hi,

On Thu, May 24, 2012 at 3:52 PM, Berk Hess <hess at kth.se> wrote:
> Hi,
>
> Currently you shouldn't use the HT cores as the thread pinning locks to the
> wrong cores.

Unless you were actually doing it the right way by disabling pinning
and doing it manually through an mdrun wrapper with MPI.

For now only non-bonded calculation is offloaded to GPUs and
therefore, the GPU acceleration relies strongly on a balance between
CPU cores and GPU. This balance is somewhere around 4-8 cores/GPU on
Intel and 6-16 cores on AMD. As Berk mentioned, on Kepler we are
getting pretty good speedup compared to Fermi, so the balance would be
slightly different.

A few more things to note:
- if you have too many CPU cores per GPU, the CPU will be "too fast"
with completing PME and bonded calculations and will be waiting for
the GPU (which you'll see in the log);
- we have automated CPU-GPU load balancing with PME which will be
limited by the minimum cut-off set in the tpr;
- with large (1.5+ nm) cut-offs the performance improvement on the CPU
diminishes while the non-bonded workload keeps growing;
- you should as much as possible avoid running OpenMP across multiple
NUMA regions, that is multiple CPUs on Intel and half CPUs on
Interlagos (Interlagos = 2 dies with HT between them).

I could continue with the list of do-s and don't-s, but as we are
still in the polishing phase, let's leave that for after the beta
release.

Cheers,
--
Szilárd

> The question on the AMD 48-core node is not specific enough. Performance
> does not
> depend on the number of cores, but on the core/GPU ratio and the PCI-E
> connection(s).
> If you have only one GPU I would suggest to use only one CPU with it on the
> AMD.
>
> We have new kepler kernels in the making which should give a least a factor
> 1.5 improvement.
>
> Cheers,
>
> Berk
>
>
> On 05/24/2012 03:11 PM, Carsten Kutzner wrote:
>>
>> Hi,
>>
>> does anyone have experiences with running the gpu-accelerated Gromacs
>> on AMD 48-core nodes? Are there known performance issues?
>>
>> We are able to get good performance with a GTX680 when it is plugged
>> in an Intel node (12 cores), but not when we plug it in an Interlagos
>> (48 cores).
>>
>> With an 80k atom test system, PME 96x96x80, cutoffs @ 1 nm,
>> GMX_NSTLIST=20-80,
>> we get (I list results for the best performing NSTLIST value only,
>> typically 40)
>> ~  8.5 ns/day on the Intel node with HT
>> ~ 13.5 ns/day on the Intel node with the GPU (with or without HT, similar
>> performance)
>> ~ 14.0 ns/day on the Intel node with TWO GPUs (with HT)
>> ~ 17.0 ns/day on the Intel node with two GPUs _without HT_
>> ~ 15.0 ns/day on the AMD Interlagos using 48 cores and no GPU
>> ~ 10.0 ns/day on the AMD Interlagos using 48 cores and the GTX680.
>>
>> Is it possible to get a speedup from the GPU with Interlagos nodes?
>>
>> Thanks for any hints,
>>   Carsten
>>
>>
>> --
>> Dr. Carsten Kutzner
>> Max Planck Institute for Biophysical Chemistry
>> Theoretical and Computational Biophysics
>> Am Fassberg 11, 37077 Goettingen, Germany
>> Tel. +49-551-2012313, Fax: +49-551-2012302
>> http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne
>>
>>
>>
>>
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.