[gmx-developers] RE: Gromacs on 48 core magny-cours AMDs
Igor Leontyev
ileontyev at ucdavis.edu
Sat Sep 17 21:04:41 CEST 2011
The problem with unstable gromacs performance still exists but there is some
progress:
1) MPI version is unstable in runs using 48 cores per node, but STABLE with
use of less then 48 cr/node.
2) MPI version is running well even on 180 cores distributed by 45 per each
of 4 nodes.
3) Threaded version has no problems on 48 core runs.
- the cluster configuration is typical (not NUMA);
- software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
2.6.18-238.19.1.el5.
- the compilation used default math libraries and OpenMPI 1.4.3 supporting
InfiniBand
Any idea why the use of all 48 cr/node results in unstable performance ?
Igor Leontyev
> Igor wrote:
> The issue might be related to configuration of our brand new cluster which
> I am testing now. On this cluster the unstable behavior of gromacs is also
> observed on Intel Xeon nodes. For gromacs installation I repeated all the
> steps that I have previously done many times on 8-core dual-Xeon
> workstation and have no problems. See bellow the compilation script.
>
> # =====================================================================
> #
> # path where to install
> pth_install=/home/leontyev/programs/bin/gromacs/gromacs-4.5.4
> # program name suffix
> suff="_mpich1.4.3"
> # path of FFTW library
> # SINGLE PRECISION
> pth_fft=/home/leontyev/programs/bin/fftw/fftw-3.2.2/single
> # path of 'open_mpi' library
> pth_lam=/home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3
> export LD_LIBRARY_PATH="$pth_lam/lib"
>
> PATH="$pth_lam/bin:$PATH"
>
> export CPPFLAGS="-I/$pth_fft/include -I/$pth_lam/include"
> export LDFLAGS="-L/$pth_fft/lib -L/$pth_lam/lib"
>
> make distclean
> # SINGLE PRECISION
> ./configure --without-x --prefix=/$pth_install --program-suffix=$suff --
> enable-mpi
>
> make -j 12 mdrun >& install.log
> make install-mdrun >> install.log
> # =====================================================================
>
> Igor
>
>
>> Alexey Shvetsov wrote:
>>
>> Hello!
>>
>> Well there may be problems
>> 1. Old kernel that works incorrectly with large numa
>> 2. No correct process binding to core
>> 3. Configuration of gcc/math libs
>>
>> What is your mpi version and versions of fftw and blas libs if you use
>> external ones.
>> Also please post your cflags.
>>
>> Here we have good performance on such nodes running SLES with 2.6.32
>> kernel (with gentoo-prefix on top of it with openmpi and ofed stack)
>> and with Gentoo (kenrel 3.0.4) with many system optimiztions made by me
>> =)
>>
>> All results are stable. Gentoo works better here becuse it doesnt has
>> bug with irq in kernel + some optimizations.
> On Sep 1, 2011, at 9:19 AM, Sander Pronk wrote:
>
>>
>> On 31 Aug 2011, at 22:10 , Igor Leontyev wrote:
>>
>>> Hi
>>> I am benchmarking a 100K atom system (protein ~12K and solvent ~90K
>>> atoms, 1 fs time step, cutoffs 1.2 nm) on a 48-core 2.1 GHz AMD node.
>>> Software: Gromacs 4.5.4; compiled by gcc4.4.6; CentOS 5.6 kernel
>>> 2.6.18-238.19.1.el5. See the results of g_tune_pme bellow. The
>>> performance is absolutely unstable, the computation time for equivalent
>>> runs can differ by orders of magnitude.
>>>
>>> The issue seems to be similar to what has been discussed earlier
>>> http://lists.gromacs.org/pipermail/gmx-users/2010-October/055113.html
>>> Is there any progress in resolving it?
>>
>> That's an old kernel. If I remember correctly, that thread discussed
>> issues related to thread&process affinity and NUMA-awareness on older
>> kernels.
>>
>> Perhaps you could try a newer kernel?
>
> Hi,
>
> we are running a slightly older kernel and get nice performance on our
> 48-core magny-cours.
> Maybe for mpich the processes are not pinning to the cores correctly.
>
> Could you try the threaded version of mdrun? This is what gives the best
> (and reliable)
> performance in our case.
>
> Carsten
>
>
>>
>>
>>>
>>> Igor
>>>
>>>
>>> ------------------------------------------------------------
>>>
>>> P E R F O R M A N C E R E S U L T S
>>>
>>> ------------------------------------------------------------
>>> g_tune_pme for Gromacs VERSION 4.5.4
>>> Number of nodes : 48
>>> The mpirun command is :
>>> /home/leontyev/programs/bin/mpi/openmpi/openmpi-1.4.3/bin/mpirun --hostfile
>>> node_loading.txt
>>> Passing # of nodes via : -np
>>> The mdrun command is :
>>> /home/leontyev/programs/bin/gromacs/gromacs-4.5.4/bin/mdrun_mpich1.4.3
>>> mdrun args benchmarks : -resetstep 100 -o bench.trr -x bench.xtc -cpo
>>> bench.cpt -c bench.gro -e bench.edr -g bench.log
>>> Benchmark steps : 1000
>>> dlb equilibration steps : 100
>>> Repeats for each test : 10
>>> Input file : cco_PM_ff03_sorin_scaled_meanpol.tpr
>>> Coulomb type : PME
>>> Grid spacing x y z : 0.114376 0.116700 0.116215
>>> Van der Waals type : Cut-off
>>>
>>> Will try these real/reciprocal workload settings:
>>> No. scaling rcoulomb nkx nky nkz spacing rvdw tpr file
>>> 0 -input- 1.200000 72 80 112 0.116700 1.200000
>>> cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr
>>>
>>> Individual timings for input file 0
>>> (cco_PM_ff03_sorin_scaled_meanpol_bench00.tpr):
>>> PME nodes Gcycles ns/day PME/f Remark
>>> 24 3185.840 2.734 0.538 OK.
>>> 24 7237.416 1.203 1.119 OK.
>>> 24 3225.448 2.700 0.546 OK.
>>> 24 5844.942 1.489 1.012 OK.
>>> 24 4013.986 2.169 0.552 OK.
>>> 24 18578.174 0.469 0.842 OK.
>>> 24 3234.702 2.692 0.559 OK.
>>> 24 25818.267 0.337 0.815 OK.
>>> 24 32470.278 0.268 0.479 OK.
>>> 24 3234.806 2.692 0.561 OK.
>>> 23 15097.577 0.577 0.824 OK.
>>> 23 2948.211 2.954 0.705 OK.
>>> 23 15640.485 0.557 0.826 OK.
>>> 23 66961.240 0.130 3.215 OK.
>>> 23 2964.927 2.938 0.698 OK.
>>> 23 2965.896 2.937 0.669 OK.
>>> 23 11205.121 0.774 0.668 OK.
>>> 23 2964.737 2.938 0.672 OK.
>>> 23 13384.753 0.649 0.665 OK.
>>> 23 3738.425 2.329 0.738 OK.
>>> 22 3130.744 2.782 0.682 OK.
>>> 22 3981.770 2.187 0.659 OK.
>>> 22 6397.259 1.350 0.666 OK.
>>> 22 41374.579 0.211 3.509 OK.
>>> 22 3193.327 2.728 0.683 OK.
>>> 22 21405.007 0.407 0.871 OK.
>>> 22 3543.511 2.457 0.686 OK.
>>> 22 3539.981 2.460 0.701 OK.
>>> 22 30946.123 0.281 1.235 OK.
>>> 22 18031.023 0.483 0.729 OK.
>>> 21 2978.520 2.924 0.699 OK.
>>> 21 4487.921 1.940 0.666 OK.
>>> 21 39796.932 0.219 1.085 OK.
>>> 21 3027.659 2.877 0.714 OK.
>>> 21 58613.050 0.149 1.089 OK.
>>> 21 2973.281 2.929 0.698 OK.
>>> 21 34991.505 0.249 0.702 OK.
>>> 21 4479.034 1.944 0.696 OK.
>>> 21 40401.894 0.216 1.310 OK.
>>> 21 63325.943 0.138 1.124 OK.
>>> 20 17100.304 0.510 0.620 OK.
>>> 20 2859.158 3.047 0.832 OK.
>>> 20 2660.459 3.274 0.820 OK.
>>> 20 2871.060 3.034 0.821 OK.
>>> 20 105947.063 0.082 0.728 OK.
>>> 20 2851.650 3.055 0.827 OK.
>>> 20 2766.737 3.149 0.837 OK.
>>> 20 13887.535 0.627 0.813 OK.
>>> 20 9450.158 0.919 0.854 OK.
>>> 20 2983.460 2.920 0.838 OK.
>>> 19 0.000 0.000 - No DD grid found for
>>> these settings.
>>> 18 62490.241 0.139 1.070 OK.
>>> 18 75625.947 0.115 0.512 OK.
>>> 18 3584.509 2.430 1.176 OK.
>>> 18 4988.745 1.734 1.197 OK.
>>> 18 92981.804 0.094 0.529 OK.
>>> 18 3070.496 2.837 1.192 OK.
>>> 18 3089.339 2.820 1.204 OK.
>>> 18 5880.675 1.465 1.170 OK.
>>> 18 3094.133 2.816 1.214 OK.
>>> 18 3573.552 2.437 1.191 OK.
>>> 17 0.000 0.000 - No DD grid found for
>>> these settings.
>>> 16 3105.597 2.805 0.998 OK.
>>> 16 2719.826 3.203 1.045 OK.
>>> 16 3124.013 2.788 0.992 OK.
>>> 16 2708.751 3.216 1.030 OK.
>>> 16 3116.887 2.795 1.023 OK.
>>> 16 2695.859 3.232 1.038 OK.
>>> 16 2710.272 3.215 1.033 OK.
>>> 16 32639.259 0.267 0.514 OK.
>>> 16 56748.577 0.153 0.959 OK.
>>> 16 32362.192 0.269 1.816 OK.
>>> 15 40410.983 0.216 1.241 OK.
>>> 15 3727.108 2.337 1.262 OK.
>>> 15 3297.944 2.642 1.242 OK.
>>> 15 23012.201 0.379 0.994 OK.
>>> 15 3328.307 2.618 1.248 OK.
>>> 15 56869.719 0.153 0.568 OK.
>>> 15 26662.044 0.327 0.854 OK.
>>> 15 44026.837 0.198 1.198 OK.
>>> 15 3754.812 2.320 1.238 OK.
>>> 15 68683.967 0.127 0.844 OK.
>>> 14 2934.532 2.969 1.466 OK.
>>> 14 2824.434 3.085 1.430 OK.
>>> 14 2778.103 3.137 1.391 OK.
>>> 14 28435.548 0.306 0.957 OK.
>>> 14 2876.113 3.030 1.396 OK.
>>> 14 2803.951 3.108 1.438 OK.
>>> 14 9538.366 0.913 1.400 OK.
>>> 14 2887.242 3.018 1.424 OK.
>>> 14 32542.115 0.268 0.529 OK.
>>> 14 14256.539 0.609 1.432 OK.
>>> 13 5010.011 1.732 1.768 OK.
>>> 13 19270.893 0.452 1.481 OK.
>>> 13 3451.426 2.525 1.860 OK.
>>> 13 28566.186 0.305 0.620 OK.
>>> 13 3481.006 2.504 1.833 OK.
>>> 13 28457.876 0.306 0.933 OK.
>>> 13 3689.128 2.362 1.795 OK.
>>> 13 3451.925 2.525 1.831 OK.
>>> 13 34918.063 0.249 1.838 OK.
>>> 13 3473.566 2.509 1.854 OK.
>>> 12 42705.256 0.204 1.039 OK.
>>> 12 4934.453 1.763 1.292 OK.
>>> 12 16759.163 0.520 1.288 OK.
>>> 12 27660.618 0.315 0.855 OK.
>>> 12 6293.874 1.380 1.263 OK.
>>> 12 40502.818 0.215 1.284 OK.
>>> 12 31595.114 0.276 0.615 OK.
>>> 12 61936.825 0.140 0.612 OK.
>>> 12 3013.850 2.891 1.345 OK.
>>> 12 3840.023 2.269 1.310 OK.
>>> 0 2628.156 3.317 - OK.
>>> 0 2573.649 3.387 - OK.
>>> 0 95523.769 0.091 - OK.
>>> 0 2594.895 3.360 - OK.
>>> 0 2614.131 3.335 - OK.
>>> 0 2610.647 3.339 - OK.
>>> 0 2560.067 3.405 - OK.
>>> 0 2609.485 3.341 - OK.
>>> 0 2603.154 3.349 - OK.
>>> 0 2583.289 3.375 - OK.
>>> -1( 16) 2672.797 3.260 1.002 OK.
>>> -1( 16) 57769.149 0.151 1.723 OK.
>>> -1( 16) 48598.334 0.179 1.138 OK.
>>> -1( 16) 2699.333 3.228 1.040 OK.
>>> -1( 16) 54243.321 0.161 1.679 OK.
>>> -1( 16) 2719.854 3.203 1.051 OK.
>>> -1( 16) 2716.365 3.207 1.051 OK.
>>> -1( 16) 24278.608 0.359 0.835 OK.
>>> -1( 16) 19357.359 0.449 1.006 OK.
>>> -1( 16) 45500.360 0.191 0.795 OK.
>>>
>>> Tuning took 500.5 minutes.
>>>
>>> ------------------------------------------------------------
>>> Summary of successful runs:
>>> Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f
>>> DD grid
>>> 0 0 24 10684.386 10896.612 1.675 0.702 3
>>> 4 2
>>> 1 0 23 13787.137 19462.982 1.678 0.968 1
>>> 5 5
>>> 2 0 22 13554.332 13814.153 1.535 1.042 2
>>> 13 1
>>> 3 0 21 25507.574 24601.033 1.358 0.878 3
>>> 3 3
>>> 4 0 20 16337.758 31934.533 2.062 0.799 2
>>> 2 7
>>> 5 0 18 25837.944 36067.176 1.689 1.045 3
>>> 2 5
>>> 6 0 16 14193.123 19370.807 2.194 1.045 4
>>> 4 2
>>> 7 0 15 27377.392 24308.700 1.132 1.069 3
>>> 11 1
>>> 8 0 14 10187.694 11414.829 2.044 1.286 1
>>> 2 17
>>> 9 0 13 13377.008 12969.168 1.547 1.581 1
>>> 5 7
>>> 10 0 12 23924.199 20299.796 0.997 1.090
>>> 3 4 3
>>> 11 0 0 11890.124 29385.874 3.030 -
>>> 6 4 2
>>> 12 0 -1( 16) 26055.548 23371.735 1.439 1.132
>>> 4 4 2
>>> --
More information about the gromacs.org_gmx-developers
mailing list