[gmx-developers] RE: Gromacs on 48 core magny-cours AMDs

Alexey Shvetsov alexxy at omrb.pnpi.spb.ru
Sat Sep 17 21:44:23 CEST 2011


Hi all!
On Sat, 17 Sep 2011 12:04:41 -0700, Igor Leontyev wrote:
> Te problem with unstable gromacs performance still exists but there
> is some progress:
> 1) MPI version is unstable in runs using 48 cores per node, but
> STABLE with use of less then 48 cr/node.
> 2) MPI version is running well even on 180 cores distributed by 45
> per each of 4 nodes.
> 3) Threaded version has no problems on 48 core runs.
Yes there will be no problems with threaded version since it dont use 
ib hba
>
> - the cluster configuration is typical (not NUMA);
You are wrong here. Each your 48 cores machine has 4 numa nodes. One 
per cpu.
> - software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
> 2.6.18-238.19.1.el5.
This kernel is too old to work correctly with such a new hardware. So 
you need newer kernel. Lets say something like >=2.6.38
> - the compilation used default math libraries and OpenMPI 1.4.3
> supporting InfiniBand
Did you use cpubindings and hwlocality for mpi process binding?
>
> Any idea why the use of all 48 cr/node results in unstable 
> performance ?
Its actualy very simple. I have similar hw. And with old kernels 
problem was the following:
1. They doesnt have good scheduler for numa systems. kernels >=2.6.36 
has good numa aware  scheduler
2. interrupts from ib hba may flood one of the cpus thats why you will 
have problems with old kernels. In kenrnels >=2.6.38 its solved
>
> Igor Leontyev
>
>> Igor wrote:
>> The issue might be related to configuration of our brand new cluster 
>> which
>> I am testing now. On this cluster the unstable behavior of gromacs 
>> is also
>> observed on Intel Xeon nodes. For gromacs installation I repeated 
>> all the
>> steps that I have previously done many times on 8-core dual-Xeon
>> workstation and have no problems. See bellow the compilation script.
>>
>> # 
>> =====================================================================
>> #
>> # path where to install
>> pth_install=/home/leontyev/programs/bin/gromacs/gromacs-4.5.4
>> # program name suffix
>> suff="_mpich1.4.3"
>> # path of FFTW library
>> # SINGLE PRECISION
>> pth_fft=/home/leontyev/programs/bin/fftw/fftw-3.2.2/single
>> # path of 'open_mpi' library
>> pth_lam=/home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3
>> export LD_LIBRARY_PATH="$pth_lam/lib"
>>
>> PATH="$pth_lam/bin:$PATH"
>>
>> export CPPFLAGS="-I/$pth_fft/include -I/$pth_lam/include"
>> export LDFLAGS="-L/$pth_fft/lib -L/$pth_lam/lib"
>>
>> make distclean
>> # SINGLE PRECISION
>> ./configure --without-x  --prefix=/$pth_install 
>> --program-suffix=$suff  --
>> enable-mpi
>>
>> make -j 12 mdrun >& install.log
>> make install-mdrun  >> install.log
>> # 
>> =====================================================================
>>
>> Igor
>>
>>
>>> Alexey Shvetsov wrote:
>>>
>>> Hello!
>>>
>>> Well there may be problems
>>> 1. Old kernel that works incorrectly with large numa
>>> 2. No correct process  binding to core
>>> 3. Configuration of gcc/math libs
>>>
>>> What is your mpi version and versions of fftw and blas libs if you 
>>> use
>>> external ones.
>>> Also please post your cflags.
>>>
>>> Here we have good performance on such nodes running SLES with 
>>> 2.6.32
>>> kernel (with gentoo-prefix on top of it with openmpi and ofed 
>>> stack)
>>> and with Gentoo (kenrel 3.0.4) with many system optimiztions made 
>>> by me
>>> =)
>>>
>>> All results are stable. Gentoo works better here becuse it doesnt 
>>> has
>>> bug with irq in kernel + some optimizations.
>
>
>> On Sep 1, 2011, at 9:19 AM, Sander Pronk wrote:
>>
>>>
>>> On 31 Aug 2011, at 22:10 , Igor Leontyev wrote:
>>>
>>>> Hi
>>>> I am benchmarking a 100K atom system (protein ~12K and solvent 
>>>> ~90K atoms, 1 fs time step, cutoffs 1.2 nm) on a 48-core 2.1 GHz AMD 
>>>> node. Software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 
>>>> kernel 2.6.18-238.19.1.el5. See the results of g_tune_pme bellow. 
>>>> The performance is absolutely unstable, the computation time for 
>>>> equivalent runs can differ by orders of magnitude.
>>>>
>>>> The issue seems to be similar to what has been discussed earlier
>>>> 
>>>> http://lists.gromacs.org/pipermail/gmx-users/2010-October/055113.html
>>>> Is there any progress in resolving it?
>>>
>>> That's an old kernel. If I remember correctly, that thread 
>>> discussed issues related to thread&process affinity and 
>>> NUMA-awareness on older kernels.
>>>
>>> Perhaps you could try a newer kernel?
>>
>> Hi,
>>
>> we are running a slightly older kernel and get nice performance on 
>> our 48-core magny-cours.
>> Maybe for mpich the processes are not pinning to the cores 
>> correctly.
>>
>> Could you try the threaded version of mdrun? This is what gives the 
>> best (and reliable)
>> performance in our case.
>>
>> Carsten
>>
>>
>>>
>>>
>>>>
>>>> Igor
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>>
>>>>    P E R F O R M A N C E   R E S U L T S
>>>>
>>>> ------------------------------------------------------------
>>>> g_tune_pme for Gromacs VERSION 4.5.4
>>>> Number of nodes         : 48
>>>> The mpirun command is   : 
>>>> /home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3/bin/mpirun 
>>>> --hostfile node_loading.txt
>>>> Passing # of nodes via  : -np
>>>> The mdrun  command is   : 
>>>> /home/leontyev/programs/bin/gromacs/gromacs-4.5.4/bin/mdrun_mpich1.4.3
>>>> mdrun args benchmarks   : -resetstep 100 -o bench.trr -x bench.xtc 
>>>> -cpo bench.cpt -c bench.gro -e bench.edr -g bench.log
>>>> Benchmark steps         : 1000
>>>> dlb equilibration steps : 100
>>>> Repeats for each test   : 10
>>>> Input file              : cco_PM_ff03_sorin_scaled_meanpol.tpr
>>>> Coulomb type         : PME
>>>> Grid spacing x y z   : 0.114376 0.116700 0.116215
>>>> Van der Waals type   : Cut-off
>>>>
>>>> Will try these real/reciprocal workload settings:
>>>> No.   scaling  rcoulomb  nkx  nky  nkz   spacing      rvdw  tpr 
>>>> file
>>>> 0   -input-  1.200000   72   80  112  0.116700   1.200000 
>>>> cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr
>>>>
>>>> Individual timings for input file 0 
>>>> (cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr):
>>>> PME nodes      Gcycles       ns/day        PME/f    Remark
>>>> 24          3185.840        2.734        0.538    OK.
>>>> 24          7237.416        1.203        1.119    OK.
>>>> 24          3225.448        2.700        0.546    OK.
>>>> 24          5844.942        1.489        1.012    OK.
>>>> 24          4013.986        2.169        0.552    OK.
>>>> 24         18578.174        0.469        0.842    OK.
>>>> 24          3234.702        2.692        0.559    OK.
>>>> 24         25818.267        0.337        0.815    OK.
>>>> 24         32470.278        0.268        0.479    OK.
>>>> 24          3234.806        2.692        0.561    OK.
>>>> 23         15097.577        0.577        0.824    OK.
>>>> 23          2948.211        2.954        0.705    OK.
>>>> 23         15640.485        0.557        0.826    OK.
>>>> 23         66961.240        0.130        3.215    OK.
>>>> 23          2964.927        2.938        0.698    OK.
>>>> 23          2965.896        2.937        0.669    OK.
>>>> 23         11205.121        0.774        0.668    OK.
>>>> 23          2964.737        2.938        0.672    OK.
>>>> 23         13384.753        0.649        0.665    OK.
>>>> 23          3738.425        2.329        0.738    OK.
>>>> 22          3130.744        2.782        0.682    OK.
>>>> 22          3981.770        2.187        0.659    OK.
>>>> 22          6397.259        1.350        0.666    OK.
>>>> 22         41374.579        0.211        3.509    OK.
>>>> 22          3193.327        2.728        0.683    OK.
>>>> 22         21405.007        0.407        0.871    OK.
>>>> 22          3543.511        2.457        0.686    OK.
>>>> 22          3539.981        2.460        0.701    OK.
>>>> 22         30946.123        0.281        1.235    OK.
>>>> 22         18031.023        0.483        0.729    OK.
>>>> 21          2978.520        2.924        0.699    OK.
>>>> 21          4487.921        1.940        0.666    OK.
>>>> 21         39796.932        0.219        1.085    OK.
>>>> 21          3027.659        2.877        0.714    OK.
>>>> 21         58613.050        0.149        1.089    OK.
>>>> 21          2973.281        2.929        0.698    OK.
>>>> 21         34991.505        0.249        0.702    OK.
>>>> 21          4479.034        1.944        0.696    OK.
>>>> 21         40401.894        0.216        1.310    OK.
>>>> 21         63325.943        0.138        1.124    OK.
>>>> 20         17100.304        0.510        0.620    OK.
>>>> 20          2859.158        3.047        0.832    OK.
>>>> 20          2660.459        3.274        0.820    OK.
>>>> 20          2871.060        3.034        0.821    OK.
>>>> 20        105947.063        0.082        0.728    OK.
>>>> 20          2851.650        3.055        0.827    OK.
>>>> 20          2766.737        3.149        0.837    OK.
>>>> 20         13887.535        0.627        0.813    OK.
>>>> 20          9450.158        0.919        0.854    OK.
>>>> 20          2983.460        2.920        0.838    OK.
>>>> 19             0.000        0.000          -      No DD grid found 
>>>> for these settings.
>>>> 18         62490.241        0.139        1.070    OK.
>>>> 18         75625.947        0.115        0.512    OK.
>>>> 18          3584.509        2.430        1.176    OK.
>>>> 18          4988.745        1.734        1.197    OK.
>>>> 18         92981.804        0.094        0.529    OK.
>>>> 18          3070.496        2.837        1.192    OK.
>>>> 18          3089.339        2.820        1.204    OK.
>>>> 18          5880.675        1.465        1.170    OK.
>>>> 18          3094.133        2.816        1.214    OK.
>>>> 18          3573.552        2.437        1.191    OK.
>>>> 17             0.000        0.000          -      No DD grid found 
>>>> for these settings.
>>>> 16          3105.597        2.805        0.998    OK.
>>>> 16          2719.826        3.203        1.045    OK.
>>>> 16          3124.013        2.788        0.992    OK.
>>>> 16          2708.751        3.216        1.030    OK.
>>>> 16          3116.887        2.795        1.023    OK.
>>>> 16          2695.859        3.232        1.038    OK.
>>>> 16          2710.272        3.215        1.033    OK.
>>>> 16         32639.259        0.267        0.514    OK.
>>>> 16         56748.577        0.153        0.959    OK.
>>>> 16         32362.192        0.269        1.816    OK.
>>>> 15         40410.983        0.216        1.241    OK.
>>>> 15          3727.108        2.337        1.262    OK.
>>>> 15          3297.944        2.642        1.242    OK.
>>>> 15         23012.201        0.379        0.994    OK.
>>>> 15          3328.307        2.618        1.248    OK.
>>>> 15         56869.719        0.153        0.568    OK.
>>>> 15         26662.044        0.327        0.854    OK.
>>>> 15         44026.837        0.198        1.198    OK.
>>>> 15          3754.812        2.320        1.238    OK.
>>>> 15         68683.967        0.127        0.844    OK.
>>>> 14          2934.532        2.969        1.466    OK.
>>>> 14          2824.434        3.085        1.430    OK.
>>>> 14          2778.103        3.137        1.391    OK.
>>>> 14         28435.548        0.306        0.957    OK.
>>>> 14          2876.113        3.030        1.396    OK.
>>>> 14          2803.951        3.108        1.438    OK.
>>>> 14          9538.366        0.913        1.400    OK.
>>>> 14          2887.242        3.018        1.424    OK.
>>>> 14         32542.115        0.268        0.529    OK.
>>>> 14         14256.539        0.609        1.432    OK.
>>>> 13          5010.011        1.732        1.768    OK.
>>>> 13         19270.893        0.452        1.481    OK.
>>>> 13          3451.426        2.525        1.860    OK.
>>>> 13         28566.186        0.305        0.620    OK.
>>>> 13          3481.006        2.504        1.833    OK.
>>>> 13         28457.876        0.306        0.933    OK.
>>>> 13          3689.128        2.362        1.795    OK.
>>>> 13          3451.925        2.525        1.831    OK.
>>>> 13         34918.063        0.249        1.838    OK.
>>>> 13          3473.566        2.509        1.854    OK.
>>>> 12         42705.256        0.204        1.039    OK.
>>>> 12          4934.453        1.763        1.292    OK.
>>>> 12         16759.163        0.520        1.288    OK.
>>>> 12         27660.618        0.315        0.855    OK.
>>>> 12          6293.874        1.380        1.263    OK.
>>>> 12         40502.818        0.215        1.284    OK.
>>>> 12         31595.114        0.276        0.615    OK.
>>>> 12         61936.825        0.140        0.612    OK.
>>>> 12          3013.850        2.891        1.345    OK.
>>>> 12          3840.023        2.269        1.310    OK.
>>>> 0          2628.156        3.317          -      OK.
>>>> 0          2573.649        3.387          -      OK.
>>>> 0         95523.769        0.091          -      OK.
>>>> 0          2594.895        3.360          -      OK.
>>>> 0          2614.131        3.335          -      OK.
>>>> 0          2610.647        3.339          -      OK.
>>>> 0          2560.067        3.405          -      OK.
>>>> 0          2609.485        3.341          -      OK.
>>>> 0          2603.154        3.349          -      OK.
>>>> 0          2583.289        3.375          -      OK.
>>>> -1( 16)     2672.797        3.260        1.002    OK.
>>>> -1( 16)    57769.149        0.151        1.723    OK.
>>>> -1( 16)    48598.334        0.179        1.138    OK.
>>>> -1( 16)     2699.333        3.228        1.040    OK.
>>>> -1( 16)    54243.321        0.161        1.679    OK.
>>>> -1( 16)     2719.854        3.203        1.051    OK.
>>>> -1( 16)     2716.365        3.207        1.051    OK.
>>>> -1( 16)    24278.608        0.359        0.835    OK.
>>>> -1( 16)    19357.359        0.449        1.006    OK.
>>>> -1( 16)    45500.360        0.191        0.795    OK.
>>>>
>>>> Tuning took   500.5 minutes.
>>>>
>>>> ------------------------------------------------------------
>>>> Summary of successful runs:
>>>> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day        
>>>> PME/f DD grid
>>>> 0   0   24         10684.386    10896.612        1.675        
>>>> 0.702    3 4   2
>>>> 1   0   23         13787.137    19462.982        1.678        
>>>> 0.968    1 5   5
>>>> 2   0   22         13554.332    13814.153        1.535        
>>>> 1.042    2 13   1
>>>> 3   0   21         25507.574    24601.033        1.358        
>>>> 0.878    3 3   3
>>>> 4   0   20         16337.758    31934.533        2.062        
>>>> 0.799    2 2   7
>>>> 5   0   18         25837.944    36067.176        1.689        
>>>> 1.045    3 2   5
>>>> 6   0   16         14193.123    19370.807        2.194        
>>>> 1.045    4 4   2
>>>> 7   0   15         27377.392    24308.700        1.132        
>>>> 1.069    3 11   1
>>>> 8   0   14         10187.694    11414.829        2.044        
>>>> 1.286    1 2  17
>>>> 9   0   13         13377.008    12969.168        1.547        
>>>> 1.581    1 5   7
>>>> 10   0   12         23924.199    20299.796        0.997        
>>>> 1.090 3 4   3
>>>> 11   0    0         11890.124    29385.874        3.030          - 
>>>> 6 4   2
>>>> 12   0   -1( 16)    26055.548    23371.735        1.439        
>>>> 1.132 4 4   2
>>>> --

-- 
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru



More information about the gromacs.org_gmx-developers mailing list