[gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs

Fri Oct 22 13:43:01 CEST 2010

Hi Carsten,

I've been thinking a bit about this issue, and for now a relatively easy fix would be to enable thread affinity when all cores on a machine are used. When fewer threads are turned on, I don't want to turn on thread affinity because any combination might either
- interfere with other running mdruns
- cause mdrun to run sub-optimally by forcing it, for example, to run two threads on the same core with hyperthreading, for example. 

The second issue *could* be solved, but would require some work that I personally feel would be the domain of the operating system. I'm looking into using hwloc right now, but that doesn't appear to have cmake support.

It appears that relatively recent kernels are pretty good at distributing jobs; do you know which kernel version and distribution gave you the unreliable performance numbers you e-mailed?

Sander

On 21 Oct 2010, at 14:04 , Carsten Kutzner wrote:

> Hi Sander,
> 
> On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote:
> 
>> Hi Carsten,
>> 
>> As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration. 
> I did not have any problems on 32-core machines as well, only on 48-core ones.
>> 
>> Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably). 
>> 
>> One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior.
> I will try that, thanks!
>> 
>> BTW What do you mean with bad performance, and how do you notice thread migration issues?
> A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 fs time 
> step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 7.5 ns/day 
> using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications should be
> run under control of numactl to be compliant to the new memory hierarchy. Also, they
> suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - which pins
> the processes to the cores - the performance was nearly doubled to 14.3 ns/day. Using 
> Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads (here no 
> pinning was necessary for the threaded version!)
> 
> Now on another machine with identical hardware (but another Linux) I get 4.5.1 timings that 
> vary a lot (see g_tune_pme snippet below) even between identical runs. One run actually approaches 
> the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I cannot be shure
> that thread migration is the problem here, but correct pinning might be necessary here. 
> 
> Carsten
> 
> 
> 
> g_tune_pme output snippet for mdrun with threads:
> -------------------------------------------------
> Benchmark steps         : 1000
> dlb equilibration steps : 100
> Repeats for each test   : 4
> 
> No.   scaling  rcoulomb  nkx  nky  nkz   spacing      rvdw  tpr file
>   0   -input-  1.000000   90   88   80  0.119865   1.000000  ./Aquaporin_gmx4_bench00.tpr
> 
> Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr):
> PME nodes      Gcycles       ns/day        PME/f    Remark
>  24          1804.442        8.736        1.703    OK.
>  24          1805.655        8.730        1.689    OK.
>  24          1260.351       12.505        0.647    OK.
>  24          1954.314        8.064        1.488    OK.
>  20          1753.386        8.992        1.960    OK.
>  20          1981.032        7.958        2.190    OK.
>  20          1344.375       11.721        1.180    OK.
>  20          1103.340       14.287        0.896    OK.
>  16          1876.134        8.404        1.713    OK.
>  16          1844.111        8.551        1.525    OK.
>  16          1757.414        8.972        1.845    OK.
>  16          1785.050        8.833        1.208    OK.
>   0          1851.645        8.520          -      OK.
>   0          1871.955        8.427          -      OK.
>   0          1978.357        7.974          -      OK.
>   0          1848.515        8.534          -      OK.
>  -1( 18)     1926.202        8.182        1.453    OK.
>  -1( 18)     1195.456       13.184        0.826    OK.
>  -1( 18)     1816.765        8.677        1.853    OK.
>  -1( 18)     1218.834       12.931        0.884    OK.
> 
> 
> 
>> Sander
>> 
>> On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:
>> 
>>> Hi,
>>> 
>>> does anyone have experience with AMD's 12-core Magny-Cours
>>> processors? With 48 cores on a node it is essential that the processes
>>> are properly pinned to the cores for optimum performance.  Numactl
>>> can do this, but at the moment I do not get good performance with
>>> 4.5.1 and threads, which still seem to be migrating around.
>>> 
>>> Carsten
>>> 
>>> 
> 
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists