[gmx-users] hanging regression test with MPI-enabled GROMACS v4.6.1 or v4.6.5 (GCC/OpenMPI/OpenBLAS/FFTW)
Kenneth Hoste
kenneth.hoste at ugent.be
Mon Dec 23 00:04:10 CET 2013
A small update on this:
* I also ran into this issue with MPICH2, but things were fine and dandy with MVAPICH2, no idea why
* the issue was resolved by setting the environment variable $OMP_NUM_THREADS to 1, which suggests this is a thread-safety issue (not sure where though)
K.
On 20 Dec 2013, at 22:37, Kenneth Hoste wrote:
> Hello,
>
> I'm having trouble with the GROMACS regression test hanging with a GROMACS built with a GCC/OpenMPI/OpenBLAS/FFTW toolchain, when MPI support is enabled (both with and without OpenMP support).
> The tests work fine when I'm using the exact same build procedure, except for enabling MPI support (an OpenMP-only build works fine).
>
> When I'm using Intel compilers + Intel MPI + Intel MKL, all is well (MPI or hybrid build works fine), but I'd like to be able to get a working build with GCC+OpenMPI as well.
>
> I'm seeing these problems with both GROMACS v4.6.1 and v4.6.5, on a Linux 64-bit (Scientific Linux 6), Intel Sandy Bridge system.
>
> In particular, I'm seeing this issue with the following combinations:
>
> * GCC 4.6.4, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3 (+ CUDA 5.0.35)
> * GCC 4.7.2, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3
> * GCC 4.8.2, OpenMPI 1.7.3, OpenBLAS 0.2.8, FFTW 3.3.3 (+ CUDA 5.5.22)
>
> GROMACS is being built with the following commands (for a hybrid build with both MPI and OpenMP enabled)
>
> cmake . -DCMAKE_INSTALL_PREFIX=/tmp -DCMAKE_C_COMPILER='mpicc' -DCMAKE_Fortran_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_COMPILER='mpicxx' -DCMAKE_Fortran_COMPILER='mpif90' -DCMAKE_C_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_BUILD_TYPE=Debug -DGMX_PREFER_STATIC_LIBS=ON -DGMX_EXTERNAL_BLAS=ON -DGMX_EXTERNAL_LAPACK=ON -DGMX_X11=OFF -DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_THREAD_MPI=OFF -DGMX_GPU=OFF -DGMX_BLAS_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran" -DGMX_LAPACK_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran" -DREGRESSIONTEST_PATH='/tmp/regressiontests-4.6.5'
> make -j 16
> make check
>
> The regression test is hanging at the very first test simple/angles1 (no additional output in mdrun.out or):
>
> $ ps axf
>
> 106963 ? S 0:00 \_ make check
> 107009 ? S 0:00 \_ make -f CMakeFiles/Makefile2 check
> 107012 ? S 0:00 \_ make -f CMakeFiles/Makefile2 CMakeFiles/check.dir/all
> 107150 ? S 0:00 \_ make -f CMakeFiles/check.dir/build.make CMakeFiles/check.dir/build
> 107151 ? S 0:00 \_ /path/to/CMake/2.8.12-goolfc-2.6.10/bin/ctest --output-on-failure
> 107152 ? S 0:00 \_ /usr/bin/perl /tmp/regressiontests-4.6.5/gmxtest.pl simple -np 8 -suffix _mpi -crosscompile -noverbose -nosuffix
> 107187 ? S 0:00 \_ sh -c mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi -notunepme -table ../table -tablep ../tablep >mdrun.out 2>&1
> 107188 ? S 0:00 \_ mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107189 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107190 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107191 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107192 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107193 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107194 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107195 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107196 ? RLl 5:49 \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
>
> $ tail mdrun.out
>
> Non-default thread affinity set probably by the OpenMP library,
> disabling internal thread affinity
>
> $ tail -5 md.log
> The maximum allowed distance for charge groups involved in interactions is:
> non-bonded interactions 0.800 nm
> two-body bonded interactions (-rdd) 0.800 nm
> multi-body bonded interactions (-rdd) 0.493 nm
>
>
> strace reveals that the different MPI processes are stuck polling (each other?):
>
> $ strace -s 128 -x -p 107189 2>&1 | head -5
> Process 107189 attached - interrupt to quit
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
>
> $ strace -s 128 -x -p 107193 2>&1 | head -5
> Process 107193 attached - interrupt to quit
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
>
>
> Has anyone run into problems like this?
>
> Pointers to get to the bottom of this are very much welcome...
>
>
> regards,
>
> Kenneth
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list