[gmx-developers] RE: Gromacs on 48 core magny-cours AMDs

Igor Leontyev ileontyev at ucdavis.edu
Sat Sep 17 22:22:42 CEST 2011


Thank you, Alexey, for prompt and informative response. If I understand you, 
the new NUMA hardware of AMD nodes requires newer linux kernel. Before we 
start installation of the newer kernel could you comment why the problem is 
also observed on Intel nodes? The server is 16 cores AMD. Could it be the 
reason for Intel nodes?

Igor Leontyev



> Alexey Shvetsov wrote

> Hi all!
> On Sat, 17 Sep 2011 12:04:41 -0700, Igor Leontyev wrote:
>> Te problem with unstable gromacs performance still exists but there
>> is some progress:
>> 1) MPI version is unstable in runs using 48 cores per node, but
>> STABLE with use of less then 48 cr/node.
>> 2) MPI version is running well even on 180 cores distributed by 45
>> per each of 4 nodes.
>> 3) Threaded version has no problems on 48 core runs.
> Yes there will be no problems with threaded version since it dont use
> ib hba
>>
>> - the cluster configuration is typical (not NUMA);
> You are wrong here. Each your 48 cores machine has 4 numa nodes. One
> per cpu.
>> - software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
>> 2.6.18-238.19.1.el5.
> This kernel is too old to work correctly with such a new hardware. So
> you need newer kernel. Lets say something like >=2.6.38
>> - the compilation used default math libraries and OpenMPI 1.4.3
>> supporting InfiniBand
> Did you use cpubindings and hwlocality for mpi process binding?
>>
>> Any idea why the use of all 48 cr/node results in unstable
>> performance ?
> Its actualy very simple. I have similar hw. And with old kernels
> problem was the following:
> 1. They doesnt have good scheduler for numa systems. kernels >=2.6.36
> has good numa aware  scheduler
> 2. interrupts from ib hba may flood one of the cpus thats why you will
> have problems with old kernels. In kenrnels >=2.6.38 its solved
>>
>> Igor Leontyev
>>
>>> Igor wrote:
>>> The issue might be related to configuration of our brand new cluster
>>> which
>>> I am testing now. On this cluster the unstable behavior of gromacs
>>> is also
>>> observed on Intel Xeon nodes. For gromacs installation I repeated
>>> all the
>>> steps that I have previously done many times on 8-core dual-Xeon
>>> workstation and have no problems. See bellow the compilation script.
>>>
>>> #
>>> =====================================================================
>>> #
>>> # path where to install
>>> pth_install=/home/leontyev/programs/bin/gromacs/gromacs-4.5.4
>>> # program name suffix
>>> suff="_mpich1.4.3"
>>> # path of FFTW library
>>> # SINGLE PRECISION
>>> pth_fft=/home/leontyev/programs/bin/fftw/fftw-3.2.2/single
>>> # path of 'open_mpi' library
>>> pth_lam=/home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3
>>> export LD_LIBRARY_PATH="$pth_lam/lib"
>>>
>>> PATH="$pth_lam/bin:$PATH"
>>>
>>> export CPPFLAGS="-I/$pth_fft/include -I/$pth_lam/include"
>>> export LDFLAGS="-L/$pth_fft/lib -L/$pth_lam/lib"
>>>
>>> make distclean
>>> # SINGLE PRECISION
>>> ./configure --without-x  --prefix=/$pth_install
>>> --program-suffix=$suff  --
>>> enable-mpi
>>>
>>> make -j 12 mdrun >& install.log
>>> make install-mdrun  >> install.log
>>> #
>>> =====================================================================
>>>
>>> Igor
>>>
>>>
>>>> Alexey Shvetsov wrote:
>>>>
>>>> Hello!
>>>>
>>>> Well there may be problems
>>>> 1. Old kernel that works incorrectly with large numa
>>>> 2. No correct process  binding to core
>>>> 3. Configuration of gcc/math libs
>>>>
>>>> What is your mpi version and versions of fftw and blas libs if you
>>>> use
>>>> external ones.
>>>> Also please post your cflags.
>>>>
>>>> Here we have good performance on such nodes running SLES with
>>>> 2.6.32
>>>> kernel (with gentoo-prefix on top of it with openmpi and ofed
>>>> stack)
>>>> and with Gentoo (kenrel 3.0.4) with many system optimiztions made
>>>> by me
>>>> =)
>>>>
>>>> All results are stable. Gentoo works better here becuse it doesnt
>>>> has
>>>> bug with irq in kernel + some optimizations.
>>
>>
>>> On Sep 1, 2011, at 9:19 AM, Sander Pronk wrote:
>>>
>>>>
>>>> On 31 Aug 2011, at 22:10 , Igor Leontyev wrote:
>>>>
>>>>> Hi
>>>>> I am benchmarking a 100K atom system (protein ~12K and solvent
>>>>> ~90K atoms, 1 fs time step, cutoffs 1.2 nm) on a 48-core 2.1 GHz AMD
>>>>> node. Software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6
>>>>> kernel 2.6.18-238.19.1.el5. See the results of g_tune_pme bellow.
>>>>> The performance is absolutely unstable, the computation time for
>>>>> equivalent runs can differ by orders of magnitude.
>>>>>
>>>>> The issue seems to be similar to what has been discussed earlier
>>>>>
>>>>> http://lists.gromacs.org/pipermail/gmx-users/2010-October/055113.html
>>>>> Is there any progress in resolving it?
>>>>
>>>> That's an old kernel. If I remember correctly, that thread
>>>> discussed issues related to thread&process affinity and
>>>> NUMA-awareness on older kernels.
>>>>
>>>> Perhaps you could try a newer kernel?
>>>
>>> Hi,
>>>
>>> we are running a slightly older kernel and get nice performance on
>>> our 48-core magny-cours.
>>> Maybe for mpich the processes are not pinning to the cores
>>> correctly.
>>>
>>> Could you try the threaded version of mdrun? This is what gives the
>>> best (and reliable)
>>> performance in our case.
>>>
>>> Carsten
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>> ------------------------------------------------------------
>>>>>
>>>>>    P E R F O R M A N C E   R E S U L T S
>>>>>
>>>>> ------------------------------------------------------------
>>>>> g_tune_pme for Gromacs VERSION 4.5.4
>>>>> Number of nodes         : 48
>>>>> The mpirun command is   :
>>>>> /home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3/bin/mpirun
>>>>> --hostfile node_loading.txt
>>>>> Passing # of nodes via  : -np
>>>>> The mdrun  command is   :
>>>>> /home/leontyev/programs/bin/gromacs/gromacs-4.5.4/bin/mdrun_mpich1.4.3
>>>>> mdrun args benchmarks   : -resetstep 100 -o bench.trr -x bench.xtc
>>>>> -cpo bench.cpt -c bench.gro -e bench.edr -g bench.log
>>>>> Benchmark steps         : 1000
>>>>> dlb equilibration steps : 100
>>>>> Repeats for each test   : 10
>>>>> Input file              : cco_PM_ff03_sorin_scaled_meanpol.tpr
>>>>> Coulomb type         : PME
>>>>> Grid spacing x y z   : 0.114376 0.116700 0.116215
>>>>> Van der Waals type   : Cut-off
>>>>>
>>>>> Will try these real/reciprocal workload settings:
>>>>> No.   scaling  rcoulomb  nkx  nky  nkz   spacing      rvdw  tpr
>>>>> file
>>>>> 0   -input-  1.200000   72   80  112  0.116700   1.200000
>>>>> cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr
>>>>>
>>>>> Individual timings for input file 0
>>>>> (cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr):
>>>>> PME nodes      Gcycles       ns/day        PME/f    Remark
>>>>> 24          3185.840        2.734        0.538    OK.
>>>>> 24          7237.416        1.203        1.119    OK.
>>>>> 24          3225.448        2.700        0.546    OK.
>>>>> 24          5844.942        1.489        1.012    OK.
>>>>> 24          4013.986        2.169        0.552    OK.
>>>>> 24         18578.174        0.469        0.842    OK.
>>>>> 24          3234.702        2.692        0.559    OK.
>>>>> 24         25818.267        0.337        0.815    OK.
>>>>> 24         32470.278        0.268        0.479    OK.
>>>>> 24          3234.806        2.692        0.561    OK.
>>>>> 23         15097.577        0.577        0.824    OK.
>>>>> 23          2948.211        2.954        0.705    OK.
>>>>> 23         15640.485        0.557        0.826    OK.
>>>>> 23         66961.240        0.130        3.215    OK.
>>>>> 23          2964.927        2.938        0.698    OK.
>>>>> 23          2965.896        2.937        0.669    OK.
>>>>> 23         11205.121        0.774        0.668    OK.
>>>>> 23          2964.737        2.938        0.672    OK.
>>>>> 23         13384.753        0.649        0.665    OK.
>>>>> 23          3738.425        2.329        0.738    OK.
>>>>> 22          3130.744        2.782        0.682    OK.
>>>>> 22          3981.770        2.187        0.659    OK.
>>>>> 22          6397.259        1.350        0.666    OK.
>>>>> 22         41374.579        0.211        3.509    OK.
>>>>> 22          3193.327        2.728        0.683    OK.
>>>>> 22         21405.007        0.407        0.871    OK.
>>>>> 22          3543.511        2.457        0.686    OK.
>>>>> 22          3539.981        2.460        0.701    OK.
>>>>> 22         30946.123        0.281        1.235    OK.
>>>>> 22         18031.023        0.483        0.729    OK.
>>>>> 21          2978.520        2.924        0.699    OK.
>>>>> 21          4487.921        1.940        0.666    OK.
>>>>> 21         39796.932        0.219        1.085    OK.
>>>>> 21          3027.659        2.877        0.714    OK.
>>>>> 21         58613.050        0.149        1.089    OK.
>>>>> 21          2973.281        2.929        0.698    OK.
>>>>> 21         34991.505        0.249        0.702    OK.
>>>>> 21          4479.034        1.944        0.696    OK.
>>>>> 21         40401.894        0.216        1.310    OK.
>>>>> 21         63325.943        0.138        1.124    OK.
>>>>> 20         17100.304        0.510        0.620    OK.
>>>>> 20          2859.158        3.047        0.832    OK.
>>>>> 20          2660.459        3.274        0.820    OK.
>>>>> 20          2871.060        3.034        0.821    OK.
>>>>> 20        105947.063        0.082        0.728    OK.
>>>>> 20          2851.650        3.055        0.827    OK.
>>>>> 20          2766.737        3.149        0.837    OK.
>>>>> 20         13887.535        0.627        0.813    OK.
>>>>> 20          9450.158        0.919        0.854    OK.
>>>>> 20          2983.460        2.920        0.838    OK.
>>>>> 19             0.000        0.000          -      No DD grid found
>>>>> for these settings.
>>>>> 18         62490.241        0.139        1.070    OK.
>>>>> 18         75625.947        0.115        0.512    OK.
>>>>> 18          3584.509        2.430        1.176    OK.
>>>>> 18          4988.745        1.734        1.197    OK.
>>>>> 18         92981.804        0.094        0.529    OK.
>>>>> 18          3070.496        2.837        1.192    OK.
>>>>> 18          3089.339        2.820        1.204    OK.
>>>>> 18          5880.675        1.465        1.170    OK.
>>>>> 18          3094.133        2.816        1.214    OK.
>>>>> 18          3573.552        2.437        1.191    OK.
>>>>> 17             0.000        0.000          -      No DD grid found
>>>>> for these settings.
>>>>> 16          3105.597        2.805        0.998    OK.
>>>>> 16          2719.826        3.203        1.045    OK.
>>>>> 16          3124.013        2.788        0.992    OK.
>>>>> 16          2708.751        3.216        1.030    OK.
>>>>> 16          3116.887        2.795        1.023    OK.
>>>>> 16          2695.859        3.232        1.038    OK.
>>>>> 16          2710.272        3.215        1.033    OK.
>>>>> 16         32639.259        0.267        0.514    OK.
>>>>> 16         56748.577        0.153        0.959    OK.
>>>>> 16         32362.192        0.269        1.816    OK.
>>>>> 15         40410.983        0.216        1.241    OK.
>>>>> 15          3727.108        2.337        1.262    OK.
>>>>> 15          3297.944        2.642        1.242    OK.
>>>>> 15         23012.201        0.379        0.994    OK.
>>>>> 15          3328.307        2.618        1.248    OK.
>>>>> 15         56869.719        0.153        0.568    OK.
>>>>> 15         26662.044        0.327        0.854    OK.
>>>>> 15         44026.837        0.198        1.198    OK.
>>>>> 15          3754.812        2.320        1.238    OK.
>>>>> 15         68683.967        0.127        0.844    OK.
>>>>> 14          2934.532        2.969        1.466    OK.
>>>>> 14          2824.434        3.085        1.430    OK.
>>>>> 14          2778.103        3.137        1.391    OK.
>>>>> 14         28435.548        0.306        0.957    OK.
>>>>> 14          2876.113        3.030        1.396    OK.
>>>>> 14          2803.951        3.108        1.438    OK.
>>>>> 14          9538.366        0.913        1.400    OK.
>>>>> 14          2887.242        3.018        1.424    OK.
>>>>> 14         32542.115        0.268        0.529    OK.
>>>>> 14         14256.539        0.609        1.432    OK.
>>>>> 13          5010.011        1.732        1.768    OK.
>>>>> 13         19270.893        0.452        1.481    OK.
>>>>> 13          3451.426        2.525        1.860    OK.
>>>>> 13         28566.186        0.305        0.620    OK.
>>>>> 13          3481.006        2.504        1.833    OK.
>>>>> 13         28457.876        0.306        0.933    OK.
>>>>> 13          3689.128        2.362        1.795    OK.
>>>>> 13          3451.925        2.525        1.831    OK.
>>>>> 13         34918.063        0.249        1.838    OK.
>>>>> 13          3473.566        2.509        1.854    OK.
>>>>> 12         42705.256        0.204        1.039    OK.
>>>>> 12          4934.453        1.763        1.292    OK.
>>>>> 12         16759.163        0.520        1.288    OK.
>>>>> 12         27660.618        0.315        0.855    OK.
>>>>> 12          6293.874        1.380        1.263    OK.
>>>>> 12         40502.818        0.215        1.284    OK.
>>>>> 12         31595.114        0.276        0.615    OK.
>>>>> 12         61936.825        0.140        0.612    OK.
>>>>> 12          3013.850        2.891        1.345    OK.
>>>>> 12          3840.023        2.269        1.310    OK.
>>>>> 0          2628.156        3.317          -      OK.
>>>>> 0          2573.649        3.387          -      OK.
>>>>> 0         95523.769        0.091          -      OK.
>>>>> 0          2594.895        3.360          -      OK.
>>>>> 0          2614.131        3.335          -      OK.
>>>>> 0          2610.647        3.339          -      OK.
>>>>> 0          2560.067        3.405          -      OK.
>>>>> 0          2609.485        3.341          -      OK.
>>>>> 0          2603.154        3.349          -      OK.
>>>>> 0          2583.289        3.375          -      OK.
>>>>> -1( 16)     2672.797        3.260        1.002    OK.
>>>>> -1( 16)    57769.149        0.151        1.723    OK.
>>>>> -1( 16)    48598.334        0.179        1.138    OK.
>>>>> -1( 16)     2699.333        3.228        1.040    OK.
>>>>> -1( 16)    54243.321        0.161        1.679    OK.
>>>>> -1( 16)     2719.854        3.203        1.051    OK.
>>>>> -1( 16)     2716.365        3.207        1.051    OK.
>>>>> -1( 16)    24278.608        0.359        0.835    OK.
>>>>> -1( 16)    19357.359        0.449        1.006    OK.
>>>>> -1( 16)    45500.360        0.191        0.795    OK.
>>>>>
>>>>> Tuning took   500.5 minutes.
>>>>>
>>>>> ------------------------------------------------------------
>>>>> Summary of successful runs:
>>>>> Line tpr PME nodes  Gcycles Av.     Std.dev.       ns/day
>>>>> PME/f DD grid
>>>>> 0   0   24         10684.386    10896.612        1.675
>>>>> 0.702    3 4   2
>>>>> 1   0   23         13787.137    19462.982        1.678
>>>>> 0.968    1 5   5
>>>>> 2   0   22         13554.332    13814.153        1.535
>>>>> 1.042    2 13   1
>>>>> 3   0   21         25507.574    24601.033        1.358
>>>>> 0.878    3 3   3
>>>>> 4   0   20         16337.758    31934.533        2.062
>>>>> 0.799    2 2   7
>>>>> 5   0   18         25837.944    36067.176        1.689
>>>>> 1.045    3 2   5
>>>>> 6   0   16         14193.123    19370.807        2.194
>>>>> 1.045    4 4   2
>>>>> 7   0   15         27377.392    24308.700        1.132
>>>>> 1.069    3 11   1
>>>>> 8   0   14         10187.694    11414.829        2.044
>>>>> 1.286    1 2  17
>>>>> 9   0   13         13377.008    12969.168        1.547
>>>>> 1.581    1 5   7
>>>>> 10   0   12         23924.199    20299.796        0.997
>>>>> 1.090 3 4   3
>>>>> 11   0    0         11890.124    29385.874        3.030          -
>>>>> 6 4   2
>>>>> 12   0   -1( 16)    26055.548    23371.735        1.439
>>>>> 1.132 4 4   2




More information about the gromacs.org_gmx-developers mailing list