[gmx-developers] RE: Gromacs on 48 core magny-cours AMDs
Alexey Shvetsov
alexxy at omrb.pnpi.spb.ru
Sat Sep 17 22:48:44 CEST 2011
Hi all!
Yep. The reason will be almoust the same for all new hardware. Old
kernels (2.6.18 comes from 2006 year, rhel based distoros uses patched
version with thousand patches and drivers backports, but the didnt touch
anything other then drivers and fs, so they simply dont know anything
about features of new hardware) has this issue. So best thing will be to
use new mainline kernels.
On Sat, 17 Sep 2011 13:22:42 -0700, Igor Leontyev wrote:
> Thank you, Alexey, for prompt and informative response. If I
> understand you, the new NUMA hardware of AMD nodes requires newer
> linux kernel. Before we start installation of the newer kernel could
> you comment why the problem is also observed on Intel nodes? The
> server is 16 cores AMD. Could it be the reason for Intel nodes?
>
> Igor Leontyev
>
>
>
>> Alexey Shvetsov wrote
>
>> Hi all!
>> On Sat, 17 Sep 2011 12:04:41 -0700, Igor Leontyev wrote:
>>> Te problem with unstable gromacs performance still exists but there
>>> is some progress:
>>> 1) MPI version is unstable in runs using 48 cores per node, but
>>> STABLE with use of less then 48 cr/node.
>>> 2) MPI version is running well even on 180 cores distributed by 45
>>> per each of 4 nodes.
>>> 3) Threaded version has no problems on 48 core runs.
>> Yes there will be no problems with threaded version since it dont
>> use
>> ib hba
>>>
>>> - the cluster configuration is typical (not NUMA);
>> You are wrong here. Each your 48 cores machine has 4 numa nodes. One
>> per cpu.
>>> - software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
>>> 2.6.18-238.19.1.el5.
>> This kernel is too old to work correctly with such a new hardware.
>> So
>> you need newer kernel. Lets say something like >=2.6.38
>>> - the compilation used default math libraries and OpenMPI 1.4.3
>>> supporting InfiniBand
>> Did you use cpubindings and hwlocality for mpi process binding?
>>>
>>> Any idea why the use of all 48 cr/node results in unstable
>>> performance ?
>> Its actualy very simple. I have similar hw. And with old kernels
>> problem was the following:
>> 1. They doesnt have good scheduler for numa systems. kernels
>> >=2.6.36
>> has good numa aware scheduler
>> 2. interrupts from ib hba may flood one of the cpus thats why you
>> will
>> have problems with old kernels. In kenrnels >=2.6.38 its solved
>>>
>>> Igor Leontyev
>>>
>>>> Igor wrote:
>>>> The issue might be related to configuration of our brand new
>>>> cluster
>>>> which
>>>> I am testing now. On this cluster the unstable behavior of gromacs
>>>> is also
>>>> observed on Intel Xeon nodes. For gromacs installation I repeated
>>>> all the
>>>> steps that I have previously done many times on 8-core dual-Xeon
>>>> workstation and have no problems. See bellow the compilation
>>>> script.
>>>>
>>>> #
>>>>
>>>> =====================================================================
>>>> #
>>>> # path where to install
>>>> pth_install=/home/leontyev/programs/bin/gromacs/gromacs-4.5.4
>>>> # program name suffix
>>>> suff="_mpich1.4.3"
>>>> # path of FFTW library
>>>> # SINGLE PRECISION
>>>> pth_fft=/home/leontyev/programs/bin/fftw/fftw-3.2.2/single
>>>> # path of 'open_mpi' library
>>>> pth_lam=/home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3
>>>> export LD_LIBRARY_PATH="$pth_lam/lib"
>>>>
>>>> PATH="$pth_lam/bin:$PATH"
>>>>
>>>> export CPPFLAGS="-I/$pth_fft/include -I/$pth_lam/include"
>>>> export LDFLAGS="-L/$pth_fft/lib -L/$pth_lam/lib"
>>>>
>>>> make distclean
>>>> # SINGLE PRECISION
>>>> ./configure --without-x --prefix=/$pth_install
>>>> --program-suffix=$suff --
>>>> enable-mpi
>>>>
>>>> make -j 12 mdrun >& install.log
>>>> make install-mdrun >> install.log
>>>> #
>>>>
>>>> =====================================================================
>>>>
>>>> Igor
>>>>
>>>>
>>>>> Alexey Shvetsov wrote:
>>>>>
>>>>> Hello!
>>>>>
>>>>> Well there may be problems
>>>>> 1. Old kernel that works incorrectly with large numa
>>>>> 2. No correct process binding to core
>>>>> 3. Configuration of gcc/math libs
>>>>>
>>>>> What is your mpi version and versions of fftw and blas libs if
>>>>> you
>>>>> use
>>>>> external ones.
>>>>> Also please post your cflags.
>>>>>
>>>>> Here we have good performance on such nodes running SLES with
>>>>> 2.6.32
>>>>> kernel (with gentoo-prefix on top of it with openmpi and ofed
>>>>> stack)
>>>>> and with Gentoo (kenrel 3.0.4) with many system optimiztions made
>>>>> by me
>>>>> =)
>>>>>
>>>>> All results are stable. Gentoo works better here becuse it doesnt
>>>>> has
>>>>> bug with irq in kernel + some optimizations.
>>>
>>>
>>>> On Sep 1, 2011, at 9:19 AM, Sander Pronk wrote:
>>>>
>>>>>
>>>>> On 31 Aug 2011, at 22:10 , Igor Leontyev wrote:
>>>>>
>>>>>> Hi
>>>>>> I am benchmarking a 100K atom system (protein ~12K and solvent
>>>>>> ~90K atoms, 1 fs time step, cutoffs 1.2 nm) on a 48-core 2.1 GHz
>>>>>> AMD
>>>>>> node. Software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6
>>>>>> kernel 2.6.18-238.19.1.el5. See the results of g_tune_pme
>>>>>> bellow.
>>>>>> The performance is absolutely unstable, the computation time for
>>>>>> equivalent runs can differ by orders of magnitude.
>>>>>>
>>>>>> The issue seems to be similar to what has been discussed earlier
>>>>>>
>>>>>>
>>>>>> http://lists.gromacs.org/pipermail/gmx-users/2010-October/055113.html
>>>>>> Is there any progress in resolving it?
>>>>>
>>>>> That's an old kernel. If I remember correctly, that thread
>>>>> discussed issues related to thread&process affinity and
>>>>> NUMA-awareness on older kernels.
>>>>>
>>>>> Perhaps you could try a newer kernel?
>>>>
>>>> Hi,
>>>>
>>>> we are running a slightly older kernel and get nice performance on
>>>> our 48-core magny-cours.
>>>> Maybe for mpich the processes are not pinning to the cores
>>>> correctly.
>>>>
>>>> Could you try the threaded version of mdrun? This is what gives
>>>> the
>>>> best (and reliable)
>>>> performance in our case.
>>>>
>>>> Carsten
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>>
>>>>>> P E R F O R M A N C E R E S U L T S
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> g_tune_pme for Gromacs VERSION 4.5.4
>>>>>> Number of nodes : 48
>>>>>> The mpirun command is :
>>>>>> /home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3/bin/mpirun
>>>>>> --hostfile node_loading.txt
>>>>>> Passing # of nodes via : -np
>>>>>> The mdrun command is :
>>>>>>
>>>>>> /home/leontyev/programs/bin/gromacs/gromacs-4.5.4/bin/mdrun_mpich1.4.3
>>>>>> mdrun args benchmarks : -resetstep 100 -o bench.trr -x
>>>>>> bench.xtc
>>>>>> -cpo bench.cpt -c bench.gro -e bench.edr -g bench.log
>>>>>> Benchmark steps : 1000
>>>>>> dlb equilibration steps : 100
>>>>>> Repeats for each test : 10
>>>>>> Input file : cco_PM_ff03_sorin_scaled_meanpol.tpr
>>>>>> Coulomb type : PME
>>>>>> Grid spacing x y z : 0.114376 0.116700 0.116215
>>>>>> Van der Waals type : Cut-off
>>>>>>
>>>>>> Will try these real/reciprocal workload settings:
>>>>>> No. scaling rcoulomb nkx nky nkz spacing rvdw tpr
>>>>>> file
>>>>>> 0 -input- 1.200000 72 80 112 0.116700 1.200000
>>>>>> cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr
>>>>>>
>>>>>> Individual timings for input file 0
>>>>>> (cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr):
>>>>>> PME nodes Gcycles ns/day PME/f Remark
>>>>>> 24 3185.840 2.734 0.538 OK.
>>>>>> 24 7237.416 1.203 1.119 OK.
>>>>>> 24 3225.448 2.700 0.546 OK.
>>>>>> 24 5844.942 1.489 1.012 OK.
>>>>>> 24 4013.986 2.169 0.552 OK.
>>>>>> 24 18578.174 0.469 0.842 OK.
>>>>>> 24 3234.702 2.692 0.559 OK.
>>>>>> 24 25818.267 0.337 0.815 OK.
>>>>>> 24 32470.278 0.268 0.479 OK.
>>>>>> 24 3234.806 2.692 0.561 OK.
>>>>>> 23 15097.577 0.577 0.824 OK.
>>>>>> 23 2948.211 2.954 0.705 OK.
>>>>>> 23 15640.485 0.557 0.826 OK.
>>>>>> 23 66961.240 0.130 3.215 OK.
>>>>>> 23 2964.927 2.938 0.698 OK.
>>>>>> 23 2965.896 2.937 0.669 OK.
>>>>>> 23 11205.121 0.774 0.668 OK.
>>>>>> 23 2964.737 2.938 0.672 OK.
>>>>>> 23 13384.753 0.649 0.665 OK.
>>>>>> 23 3738.425 2.329 0.738 OK.
>>>>>> 22 3130.744 2.782 0.682 OK.
>>>>>> 22 3981.770 2.187 0.659 OK.
>>>>>> 22 6397.259 1.350 0.666 OK.
>>>>>> 22 41374.579 0.211 3.509 OK.
>>>>>> 22 3193.327 2.728 0.683 OK.
>>>>>> 22 21405.007 0.407 0.871 OK.
>>>>>> 22 3543.511 2.457 0.686 OK.
>>>>>> 22 3539.981 2.460 0.701 OK.
>>>>>> 22 30946.123 0.281 1.235 OK.
>>>>>> 22 18031.023 0.483 0.729 OK.
>>>>>> 21 2978.520 2.924 0.699 OK.
>>>>>> 21 4487.921 1.940 0.666 OK.
>>>>>> 21 39796.932 0.219 1.085 OK.
>>>>>> 21 3027.659 2.877 0.714 OK.
>>>>>> 21 58613.050 0.149 1.089 OK.
>>>>>> 21 2973.281 2.929 0.698 OK.
>>>>>> 21 34991.505 0.249 0.702 OK.
>>>>>> 21 4479.034 1.944 0.696 OK.
>>>>>> 21 40401.894 0.216 1.310 OK.
>>>>>> 21 63325.943 0.138 1.124 OK.
>>>>>> 20 17100.304 0.510 0.620 OK.
>>>>>> 20 2859.158 3.047 0.832 OK.
>>>>>> 20 2660.459 3.274 0.820 OK.
>>>>>> 20 2871.060 3.034 0.821 OK.
>>>>>> 20 105947.063 0.082 0.728 OK.
>>>>>> 20 2851.650 3.055 0.827 OK.
>>>>>> 20 2766.737 3.149 0.837 OK.
>>>>>> 20 13887.535 0.627 0.813 OK.
>>>>>> 20 9450.158 0.919 0.854 OK.
>>>>>> 20 2983.460 2.920 0.838 OK.
>>>>>> 19 0.000 0.000 - No DD grid
>>>>>> found
>>>>>> for these settings.
>>>>>> 18 62490.241 0.139 1.070 OK.
>>>>>> 18 75625.947 0.115 0.512 OK.
>>>>>> 18 3584.509 2.430 1.176 OK.
>>>>>> 18 4988.745 1.734 1.197 OK.
>>>>>> 18 92981.804 0.094 0.529 OK.
>>>>>> 18 3070.496 2.837 1.192 OK.
>>>>>> 18 3089.339 2.820 1.204 OK.
>>>>>> 18 5880.675 1.465 1.170 OK.
>>>>>> 18 3094.133 2.816 1.214 OK.
>>>>>> 18 3573.552 2.437 1.191 OK.
>>>>>> 17 0.000 0.000 - No DD grid
>>>>>> found
>>>>>> for these settings.
>>>>>> 16 3105.597 2.805 0.998 OK.
>>>>>> 16 2719.826 3.203 1.045 OK.
>>>>>> 16 3124.013 2.788 0.992 OK.
>>>>>> 16 2708.751 3.216 1.030 OK.
>>>>>> 16 3116.887 2.795 1.023 OK.
>>>>>> 16 2695.859 3.232 1.038 OK.
>>>>>> 16 2710.272 3.215 1.033 OK.
>>>>>> 16 32639.259 0.267 0.514 OK.
>>>>>> 16 56748.577 0.153 0.959 OK.
>>>>>> 16 32362.192 0.269 1.816 OK.
>>>>>> 15 40410.983 0.216 1.241 OK.
>>>>>> 15 3727.108 2.337 1.262 OK.
>>>>>> 15 3297.944 2.642 1.242 OK.
>>>>>> 15 23012.201 0.379 0.994 OK.
>>>>>> 15 3328.307 2.618 1.248 OK.
>>>>>> 15 56869.719 0.153 0.568 OK.
>>>>>> 15 26662.044 0.327 0.854 OK.
>>>>>> 15 44026.837 0.198 1.198 OK.
>>>>>> 15 3754.812 2.320 1.238 OK.
>>>>>> 15 68683.967 0.127 0.844 OK.
>>>>>> 14 2934.532 2.969 1.466 OK.
>>>>>> 14 2824.434 3.085 1.430 OK.
>>>>>> 14 2778.103 3.137 1.391 OK.
>>>>>> 14 28435.548 0.306 0.957 OK.
>>>>>> 14 2876.113 3.030 1.396 OK.
>>>>>> 14 2803.951 3.108 1.438 OK.
>>>>>> 14 9538.366 0.913 1.400 OK.
>>>>>> 14 2887.242 3.018 1.424 OK.
>>>>>> 14 32542.115 0.268 0.529 OK.
>>>>>> 14 14256.539 0.609 1.432 OK.
>>>>>> 13 5010.011 1.732 1.768 OK.
>>>>>> 13 19270.893 0.452 1.481 OK.
>>>>>> 13 3451.426 2.525 1.860 OK.
>>>>>> 13 28566.186 0.305 0.620 OK.
>>>>>> 13 3481.006 2.504 1.833 OK.
>>>>>> 13 28457.876 0.306 0.933 OK.
>>>>>> 13 3689.128 2.362 1.795 OK.
>>>>>> 13 3451.925 2.525 1.831 OK.
>>>>>> 13 34918.063 0.249 1.838 OK.
>>>>>> 13 3473.566 2.509 1.854 OK.
>>>>>> 12 42705.256 0.204 1.039 OK.
>>>>>> 12 4934.453 1.763 1.292 OK.
>>>>>> 12 16759.163 0.520 1.288 OK.
>>>>>> 12 27660.618 0.315 0.855 OK.
>>>>>> 12 6293.874 1.380 1.263 OK.
>>>>>> 12 40502.818 0.215 1.284 OK.
>>>>>> 12 31595.114 0.276 0.615 OK.
>>>>>> 12 61936.825 0.140 0.612 OK.
>>>>>> 12 3013.850 2.891 1.345 OK.
>>>>>> 12 3840.023 2.269 1.310 OK.
>>>>>> 0 2628.156 3.317 - OK.
>>>>>> 0 2573.649 3.387 - OK.
>>>>>> 0 95523.769 0.091 - OK.
>>>>>> 0 2594.895 3.360 - OK.
>>>>>> 0 2614.131 3.335 - OK.
>>>>>> 0 2610.647 3.339 - OK.
>>>>>> 0 2560.067 3.405 - OK.
>>>>>> 0 2609.485 3.341 - OK.
>>>>>> 0 2603.154 3.349 - OK.
>>>>>> 0 2583.289 3.375 - OK.
>>>>>> -1( 16) 2672.797 3.260 1.002 OK.
>>>>>> -1( 16) 57769.149 0.151 1.723 OK.
>>>>>> -1( 16) 48598.334 0.179 1.138 OK.
>>>>>> -1( 16) 2699.333 3.228 1.040 OK.
>>>>>> -1( 16) 54243.321 0.161 1.679 OK.
>>>>>> -1( 16) 2719.854 3.203 1.051 OK.
>>>>>> -1( 16) 2716.365 3.207 1.051 OK.
>>>>>> -1( 16) 24278.608 0.359 0.835 OK.
>>>>>> -1( 16) 19357.359 0.449 1.006 OK.
>>>>>> -1( 16) 45500.360 0.191 0.795 OK.
>>>>>>
>>>>>> Tuning took 500.5 minutes.
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> Summary of successful runs:
>>>>>> Line tpr PME nodes Gcycles Av. Std.dev. ns/day
>>>>>> PME/f DD grid
>>>>>> 0 0 24 10684.386 10896.612 1.675
>>>>>> 0.702 3 4 2
>>>>>> 1 0 23 13787.137 19462.982 1.678
>>>>>> 0.968 1 5 5
>>>>>> 2 0 22 13554.332 13814.153 1.535
>>>>>> 1.042 2 13 1
>>>>>> 3 0 21 25507.574 24601.033 1.358
>>>>>> 0.878 3 3 3
>>>>>> 4 0 20 16337.758 31934.533 2.062
>>>>>> 0.799 2 2 7
>>>>>> 5 0 18 25837.944 36067.176 1.689
>>>>>> 1.045 3 2 5
>>>>>> 6 0 16 14193.123 19370.807 2.194
>>>>>> 1.045 4 4 2
>>>>>> 7 0 15 27377.392 24308.700 1.132
>>>>>> 1.069 3 11 1
>>>>>> 8 0 14 10187.694 11414.829 2.044
>>>>>> 1.286 1 2 17
>>>>>> 9 0 13 13377.008 12969.168 1.547
>>>>>> 1.581 1 5 7
>>>>>> 10 0 12 23924.199 20299.796 0.997
>>>>>> 1.090 3 4 3
>>>>>> 11 0 0 11890.124 29385.874 3.030
>>>>>> -
>>>>>> 6 4 2
>>>>>> 12 0 -1( 16) 26055.548 23371.735 1.439
>>>>>> 1.132 4 4 2
--
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru
More information about the gromacs.org_gmx-developers
mailing list