[gmx-users] Gromacs 4.5.1 on 48 core magny-cours AMDs
Carsten Kutzner
ckutzne at gwdg.de
Thu Oct 21 14:04:05 CEST 2010
Hi Sander,
On Oct 21, 2010, at 12:27 PM, Sander Pronk wrote:
> Hi Carsten,
>
> As Berk noted, we haven't had problems on 24-core machines, but quite frankly I haven't looked at thread migration.
I did not have any problems on 32-core machines as well, only on 48-core ones.
>
> Currently, the wait states actively yield to the scheduler, which is an opportunity for the scheduler to re-assign threads to different cores. I could set harder thread affinity but that could compromise system responsiveness (when running mdrun on a desktop machine without active yielding, the system slows down noticeably).
>
> One thing you could try is to turn on the THREAD_MPI_WAIT_FOR_NO_ONE option in cmake. That turns off the yielding which might change the migration behavior.
I will try that, thanks!
>
> BTW What do you mean with bad performance, and how do you notice thread migration issues?
A while ago I benchmarked a ~80,000 atom test system (membrane+channel+water, 2 fs time
step, cutoffs @ 1 nm) on a 48-core 1.9 GHz AMD node. My first try gave a lousy 7.5 ns/day
using Gromacs 4.0.7 and IntelMPI. According to AMD, parallel applications should be
run under control of numactl to be compliant to the new memory hierarchy. Also, they
suggest using OpenMPI rather than other MPI libs. With OpenMPI and numactl - which pins
the processes to the cores - the performance was nearly doubled to 14.3 ns/day. Using
Gromacs 4.5 I got 14.0 ns/day with OpenMPI+numactl and 15.2 ns/day with threads (here no
pinning was necessary for the threaded version!)
Now on another machine with identical hardware (but another Linux) I get 4.5.1 timings that
vary a lot (see g_tune_pme snippet below) even between identical runs. One run actually approaches
the expected 15 ns/day, while the others with also 20 PME-only nodes) do not. I cannot be shure
that thread migration is the problem here, but correct pinning might be necessary here.
Carsten
g_tune_pme output snippet for mdrun with threads:
-------------------------------------------------
Benchmark steps : 1000
dlb equilibration steps : 100
Repeats for each test : 4
No. scaling rcoulomb nkx nky nkz spacing rvdw tpr file
0 -input- 1.000000 90 88 80 0.119865 1.000000 ./Aquaporin_gmx4_bench00.tpr
Individual timings for input file 0 (./Aquaporin_gmx4_bench00.tpr):
PME nodes Gcycles ns/day PME/f Remark
24 1804.442 8.736 1.703 OK.
24 1805.655 8.730 1.689 OK.
24 1260.351 12.505 0.647 OK.
24 1954.314 8.064 1.488 OK.
20 1753.386 8.992 1.960 OK.
20 1981.032 7.958 2.190 OK.
20 1344.375 11.721 1.180 OK.
20 1103.340 14.287 0.896 OK.
16 1876.134 8.404 1.713 OK.
16 1844.111 8.551 1.525 OK.
16 1757.414 8.972 1.845 OK.
16 1785.050 8.833 1.208 OK.
0 1851.645 8.520 - OK.
0 1871.955 8.427 - OK.
0 1978.357 7.974 - OK.
0 1848.515 8.534 - OK.
-1( 18) 1926.202 8.182 1.453 OK.
-1( 18) 1195.456 13.184 0.826 OK.
-1( 18) 1816.765 8.677 1.853 OK.
-1( 18) 1218.834 12.931 0.884 OK.
> Sander
>
> On 21 Oct 2010, at 12:03 , Carsten Kutzner wrote:
>
>> Hi,
>>
>> does anyone have experience with AMD's 12-core Magny-Cours
>> processors? With 48 cores on a node it is essential that the processes
>> are properly pinned to the cores for optimum performance. Numactl
>> can do this, but at the moment I do not get good performance with
>> 4.5.1 and threads, which still seem to be migrating around.
>>
>> Carsten
>>
>>
More information about the gromacs.org_gmx-users
mailing list