[gmx-developers] RE: Gromacs on 48 core magny-cours AMDs
Alexey Shvetsov
alexxy at omrb.pnpi.spb.ru
Sat Sep 17 21:44:23 CEST 2011
Hi all!
On Sat, 17 Sep 2011 12:04:41 -0700, Igor Leontyev wrote:
> Te problem with unstable gromacs performance still exists but there
> is some progress:
> 1) MPI version is unstable in runs using 48 cores per node, but
> STABLE with use of less then 48 cr/node.
> 2) MPI version is running well even on 180 cores distributed by 45
> per each of 4 nodes.
> 3) Threaded version has no problems on 48 core runs.
Yes there will be no problems with threaded version since it dont use
ib hba
>
> - the cluster configuration is typical (not NUMA);
You are wrong here. Each your 48 cores machine has 4 numa nodes. One
per cpu.
> - software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
> 2.6.18-238.19.1.el5.
This kernel is too old to work correctly with such a new hardware. So
you need newer kernel. Lets say something like >=2.6.38
> - the compilation used default math libraries and OpenMPI 1.4.3
> supporting InfiniBand
Did you use cpubindings and hwlocality for mpi process binding?
>
> Any idea why the use of all 48 cr/node results in unstable
> performance ?
Its actualy very simple. I have similar hw. And with old kernels
problem was the following:
1. They doesnt have good scheduler for numa systems. kernels >=2.6.36
has good numa aware scheduler
2. interrupts from ib hba may flood one of the cpus thats why you will
have problems with old kernels. In kenrnels >=2.6.38 its solved
>
> Igor Leontyev
>
>> Igor wrote:
>> The issue might be related to configuration of our brand new cluster
>> which
>> I am testing now. On this cluster the unstable behavior of gromacs
>> is also
>> observed on Intel Xeon nodes. For gromacs installation I repeated
>> all the
>> steps that I have previously done many times on 8-core dual-Xeon
>> workstation and have no problems. See bellow the compilation script.
>>
>> #
>> =====================================================================
>> #
>> # path where to install
>> pth_install=/home/leontyev/programs/bin/gromacs/gromacs-4.5.4
>> # program name suffix
>> suff="_mpich1.4.3"
>> # path of FFTW library
>> # SINGLE PRECISION
>> pth_fft=/home/leontyev/programs/bin/fftw/fftw-3.2.2/single
>> # path of 'open_mpi' library
>> pth_lam=/home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3
>> export LD_LIBRARY_PATH="$pth_lam/lib"
>>
>> PATH="$pth_lam/bin:$PATH"
>>
>> export CPPFLAGS="-I/$pth_fft/include -I/$pth_lam/include"
>> export LDFLAGS="-L/$pth_fft/lib -L/$pth_lam/lib"
>>
>> make distclean
>> # SINGLE PRECISION
>> ./configure --without-x --prefix=/$pth_install
>> --program-suffix=$suff --
>> enable-mpi
>>
>> make -j 12 mdrun >& install.log
>> make install-mdrun >> install.log
>> #
>> =====================================================================
>>
>> Igor
>>
>>
>>> Alexey Shvetsov wrote:
>>>
>>> Hello!
>>>
>>> Well there may be problems
>>> 1. Old kernel that works incorrectly with large numa
>>> 2. No correct process binding to core
>>> 3. Configuration of gcc/math libs
>>>
>>> What is your mpi version and versions of fftw and blas libs if you
>>> use
>>> external ones.
>>> Also please post your cflags.
>>>
>>> Here we have good performance on such nodes running SLES with
>>> 2.6.32
>>> kernel (with gentoo-prefix on top of it with openmpi and ofed
>>> stack)
>>> and with Gentoo (kenrel 3.0.4) with many system optimiztions made
>>> by me
>>> =)
>>>
>>> All results are stable. Gentoo works better here becuse it doesnt
>>> has
>>> bug with irq in kernel + some optimizations.
>
>
>> On Sep 1, 2011, at 9:19 AM, Sander Pronk wrote:
>>
>>>
>>> On 31 Aug 2011, at 22:10 , Igor Leontyev wrote:
>>>
>>>> Hi
>>>> I am benchmarking a 100K atom system (protein ~12K and solvent
>>>> ~90K atoms, 1 fs time step, cutoffs 1.2 nm) on a 48-core 2.1 GHz AMD
>>>> node. Software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6
>>>> kernel 2.6.18-238.19.1.el5. See the results of g_tune_pme bellow.
>>>> The performance is absolutely unstable, the computation time for
>>>> equivalent runs can differ by orders of magnitude.
>>>>
>>>> The issue seems to be similar to what has been discussed earlier
>>>>
>>>> http://lists.gromacs.org/pipermail/gmx-users/2010-October/055113.html
>>>> Is there any progress in resolving it?
>>>
>>> That's an old kernel. If I remember correctly, that thread
>>> discussed issues related to thread&process affinity and
>>> NUMA-awareness on older kernels.
>>>
>>> Perhaps you could try a newer kernel?
>>
>> Hi,
>>
>> we are running a slightly older kernel and get nice performance on
>> our 48-core magny-cours.
>> Maybe for mpich the processes are not pinning to the cores
>> correctly.
>>
>> Could you try the threaded version of mdrun? This is what gives the
>> best (and reliable)
>> performance in our case.
>>
>> Carsten
>>
>>
>>>
>>>
>>>>
>>>> Igor
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>>
>>>> P E R F O R M A N C E R E S U L T S
>>>>
>>>> ------------------------------------------------------------
>>>> g_tune_pme for Gromacs VERSION 4.5.4
>>>> Number of nodes : 48
>>>> The mpirun command is :
>>>> /home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3/bin/mpirun
>>>> --hostfile node_loading.txt
>>>> Passing # of nodes via : -np
>>>> The mdrun command is :
>>>> /home/leontyev/programs/bin/gromacs/gromacs-4.5.4/bin/mdrun_mpich1.4.3
>>>> mdrun args benchmarks : -resetstep 100 -o bench.trr -x bench.xtc
>>>> -cpo bench.cpt -c bench.gro -e bench.edr -g bench.log
>>>> Benchmark steps : 1000
>>>> dlb equilibration steps : 100
>>>> Repeats for each test : 10
>>>> Input file : cco_PM_ff03_sorin_scaled_meanpol.tpr
>>>> Coulomb type : PME
>>>> Grid spacing x y z : 0.114376 0.116700 0.116215
>>>> Van der Waals type : Cut-off
>>>>
>>>> Will try these real/reciprocal workload settings:
>>>> No. scaling rcoulomb nkx nky nkz spacing rvdw tpr
>>>> file
>>>> 0 -input- 1.200000 72 80 112 0.116700 1.200000
>>>> cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr
>>>>
>>>> Individual timings for input file 0
>>>> (cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr):
>>>> PME nodes Gcycles ns/day PME/f Remark
>>>> 24 3185.840 2.734 0.538 OK.
>>>> 24 7237.416 1.203 1.119 OK.
>>>> 24 3225.448 2.700 0.546 OK.
>>>> 24 5844.942 1.489 1.012 OK.
>>>> 24 4013.986 2.169 0.552 OK.
>>>> 24 18578.174 0.469 0.842 OK.
>>>> 24 3234.702 2.692 0.559 OK.
>>>> 24 25818.267 0.337 0.815 OK.
>>>> 24 32470.278 0.268 0.479 OK.
>>>> 24 3234.806 2.692 0.561 OK.
>>>> 23 15097.577 0.577 0.824 OK.
>>>> 23 2948.211 2.954 0.705 OK.
>>>> 23 15640.485 0.557 0.826 OK.
>>>> 23 66961.240 0.130 3.215 OK.
>>>> 23 2964.927 2.938 0.698 OK.
>>>> 23 2965.896 2.937 0.669 OK.
>>>> 23 11205.121 0.774 0.668 OK.
>>>> 23 2964.737 2.938 0.672 OK.
>>>> 23 13384.753 0.649 0.665 OK.
>>>> 23 3738.425 2.329 0.738 OK.
>>>> 22 3130.744 2.782 0.682 OK.
>>>> 22 3981.770 2.187 0.659 OK.
>>>> 22 6397.259 1.350 0.666 OK.
>>>> 22 41374.579 0.211 3.509 OK.
>>>> 22 3193.327 2.728 0.683 OK.
>>>> 22 21405.007 0.407 0.871 OK.
>>>> 22 3543.511 2.457 0.686 OK.
>>>> 22 3539.981 2.460 0.701 OK.
>>>> 22 30946.123 0.281 1.235 OK.
>>>> 22 18031.023 0.483 0.729 OK.
>>>> 21 2978.520 2.924 0.699 OK.
>>>> 21 4487.921 1.940 0.666 OK.
>>>> 21 39796.932 0.219 1.085 OK.
>>>> 21 3027.659 2.877 0.714 OK.
>>>> 21 58613.050 0.149 1.089 OK.
>>>> 21 2973.281 2.929 0.698 OK.
>>>> 21 34991.505 0.249 0.702 OK.
>>>> 21 4479.034 1.944 0.696 OK.
>>>> 21 40401.894 0.216 1.310 OK.
>>>> 21 63325.943 0.138 1.124 OK.
>>>> 20 17100.304 0.510 0.620 OK.
>>>> 20 2859.158 3.047 0.832 OK.
>>>> 20 2660.459 3.274 0.820 OK.
>>>> 20 2871.060 3.034 0.821 OK.
>>>> 20 105947.063 0.082 0.728 OK.
>>>> 20 2851.650 3.055 0.827 OK.
>>>> 20 2766.737 3.149 0.837 OK.
>>>> 20 13887.535 0.627 0.813 OK.
>>>> 20 9450.158 0.919 0.854 OK.
>>>> 20 2983.460 2.920 0.838 OK.
>>>> 19 0.000 0.000 - No DD grid found
>>>> for these settings.
>>>> 18 62490.241 0.139 1.070 OK.
>>>> 18 75625.947 0.115 0.512 OK.
>>>> 18 3584.509 2.430 1.176 OK.
>>>> 18 4988.745 1.734 1.197 OK.
>>>> 18 92981.804 0.094 0.529 OK.
>>>> 18 3070.496 2.837 1.192 OK.
>>>> 18 3089.339 2.820 1.204 OK.
>>>> 18 5880.675 1.465 1.170 OK.
>>>> 18 3094.133 2.816 1.214 OK.
>>>> 18 3573.552 2.437 1.191 OK.
>>>> 17 0.000 0.000 - No DD grid found
>>>> for these settings.
>>>> 16 3105.597 2.805 0.998 OK.
>>>> 16 2719.826 3.203 1.045 OK.
>>>> 16 3124.013 2.788 0.992 OK.
>>>> 16 2708.751 3.216 1.030 OK.
>>>> 16 3116.887 2.795 1.023 OK.
>>>> 16 2695.859 3.232 1.038 OK.
>>>> 16 2710.272 3.215 1.033 OK.
>>>> 16 32639.259 0.267 0.514 OK.
>>>> 16 56748.577 0.153 0.959 OK.
>>>> 16 32362.192 0.269 1.816 OK.
>>>> 15 40410.983 0.216 1.241 OK.
>>>> 15 3727.108 2.337 1.262 OK.
>>>> 15 3297.944 2.642 1.242 OK.
>>>> 15 23012.201 0.379 0.994 OK.
>>>> 15 3328.307 2.618 1.248 OK.
>>>> 15 56869.719 0.153 0.568 OK.
>>>> 15 26662.044 0.327 0.854 OK.
>>>> 15 44026.837 0.198 1.198 OK.
>>>> 15 3754.812 2.320 1.238 OK.
>>>> 15 68683.967 0.127 0.844 OK.
>>>> 14 2934.532 2.969 1.466 OK.
>>>> 14 2824.434 3.085 1.430 OK.
>>>> 14 2778.103 3.137 1.391 OK.
>>>> 14 28435.548 0.306 0.957 OK.
>>>> 14 2876.113 3.030 1.396 OK.
>>>> 14 2803.951 3.108 1.438 OK.
>>>> 14 9538.366 0.913 1.400 OK.
>>>> 14 2887.242 3.018 1.424 OK.
>>>> 14 32542.115 0.268 0.529 OK.
>>>> 14 14256.539 0.609 1.432 OK.
>>>> 13 5010.011 1.732 1.768 OK.
>>>> 13 19270.893 0.452 1.481 OK.
>>>> 13 3451.426 2.525 1.860 OK.
>>>> 13 28566.186 0.305 0.620 OK.
>>>> 13 3481.006 2.504 1.833 OK.
>>>> 13 28457.876 0.306 0.933 OK.
>>>> 13 3689.128 2.362 1.795 OK.
>>>> 13 3451.925 2.525 1.831 OK.
>>>> 13 34918.063 0.249 1.838 OK.
>>>> 13 3473.566 2.509 1.854 OK.
>>>> 12 42705.256 0.204 1.039 OK.
>>>> 12 4934.453 1.763 1.292 OK.
>>>> 12 16759.163 0.520 1.288 OK.
>>>> 12 27660.618 0.315 0.855 OK.
>>>> 12 6293.874 1.380 1.263 OK.
>>>> 12 40502.818 0.215 1.284 OK.
>>>> 12 31595.114 0.276 0.615 OK.
>>>> 12 61936.825 0.140 0.612 OK.
>>>> 12 3013.850 2.891 1.345 OK.
>>>> 12 3840.023 2.269 1.310 OK.
>>>> 0 2628.156 3.317 - OK.
>>>> 0 2573.649 3.387 - OK.
>>>> 0 95523.769 0.091 - OK.
>>>> 0 2594.895 3.360 - OK.
>>>> 0 2614.131 3.335 - OK.
>>>> 0 2610.647 3.339 - OK.
>>>> 0 2560.067 3.405 - OK.
>>>> 0 2609.485 3.341 - OK.
>>>> 0 2603.154 3.349 - OK.
>>>> 0 2583.289 3.375 - OK.
>>>> -1( 16) 2672.797 3.260 1.002 OK.
>>>> -1( 16) 57769.149 0.151 1.723 OK.
>>>> -1( 16) 48598.334 0.179 1.138 OK.
>>>> -1( 16) 2699.333 3.228 1.040 OK.
>>>> -1( 16) 54243.321 0.161 1.679 OK.
>>>> -1( 16) 2719.854 3.203 1.051 OK.
>>>> -1( 16) 2716.365 3.207 1.051 OK.
>>>> -1( 16) 24278.608 0.359 0.835 OK.
>>>> -1( 16) 19357.359 0.449 1.006 OK.
>>>> -1( 16) 45500.360 0.191 0.795 OK.
>>>>
>>>> Tuning took 500.5 minutes.
>>>>
>>>> ------------------------------------------------------------
>>>> Summary of successful runs:
>>>> Line tpr PME nodes Gcycles Av. Std.dev. ns/day
>>>> PME/f DD grid
>>>> 0 0 24 10684.386 10896.612 1.675
>>>> 0.702 3 4 2
>>>> 1 0 23 13787.137 19462.982 1.678
>>>> 0.968 1 5 5
>>>> 2 0 22 13554.332 13814.153 1.535
>>>> 1.042 2 13 1
>>>> 3 0 21 25507.574 24601.033 1.358
>>>> 0.878 3 3 3
>>>> 4 0 20 16337.758 31934.533 2.062
>>>> 0.799 2 2 7
>>>> 5 0 18 25837.944 36067.176 1.689
>>>> 1.045 3 2 5
>>>> 6 0 16 14193.123 19370.807 2.194
>>>> 1.045 4 4 2
>>>> 7 0 15 27377.392 24308.700 1.132
>>>> 1.069 3 11 1
>>>> 8 0 14 10187.694 11414.829 2.044
>>>> 1.286 1 2 17
>>>> 9 0 13 13377.008 12969.168 1.547
>>>> 1.581 1 5 7
>>>> 10 0 12 23924.199 20299.796 0.997
>>>> 1.090 3 4 3
>>>> 11 0 0 11890.124 29385.874 3.030 -
>>>> 6 4 2
>>>> 12 0 -1( 16) 26055.548 23371.735 1.439
>>>> 1.132 4 4 2
>>>> --
--
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru
More information about the gromacs.org_gmx-developers
mailing list