[gmx-users] hanging regression test with MPI-enabled GROMACS v4.6.1 or v4.6.5 (GCC/OpenMPI/OpenBLAS/FFTW)

Kenneth Hoste kenneth.hoste at ugent.be
Fri Dec 20 22:37:36 CET 2013


Hello,

I'm having trouble with the GROMACS regression test hanging with a GROMACS built with a GCC/OpenMPI/OpenBLAS/FFTW toolchain, when MPI support is enabled (both with and without OpenMP support).
The tests work fine when I'm using the exact same build procedure, except for enabling MPI support (an OpenMP-only build works fine).

When I'm using Intel compilers + Intel MPI + Intel MKL, all is well (MPI or hybrid build works fine), but I'd like to be able to get a working build with GCC+OpenMPI as well.

I'm seeing these problems with both GROMACS v4.6.1 and v4.6.5, on a Linux 64-bit (Scientific Linux 6), Intel Sandy Bridge system.

In particular, I'm seeing this issue with the following combinations:

	* GCC 4.6.4, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3 (+ CUDA 5.0.35)
	* GCC 4.7.2, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3
	* GCC 4.8.2, OpenMPI 1.7.3, OpenBLAS 0.2.8, FFTW 3.3.3 (+ CUDA 5.5.22)

GROMACS is being built with the following commands (for a hybrid build with both MPI and OpenMP enabled)

	cmake . -DCMAKE_INSTALL_PREFIX=/tmp -DCMAKE_C_COMPILER='mpicc' -DCMAKE_Fortran_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_COMPILER='mpicxx' -DCMAKE_Fortran_COMPILER='mpif90' -DCMAKE_C_FLAGS='-fopenmp -O2 -march=native'  -DCMAKE_BUILD_TYPE=Debug  -DGMX_PREFER_STATIC_LIBS=ON  -DGMX_EXTERNAL_BLAS=ON -DGMX_EXTERNAL_LAPACK=ON  -DGMX_X11=OFF  -DGMX_OPENMP=ON  -DGMX_MPI=ON -DGMX_THREAD_MPI=OFF  -DGMX_GPU=OFF  -DGMX_BLAS_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran"  -DGMX_LAPACK_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran"  -DREGRESSIONTEST_PATH='/tmp/regressiontests-4.6.5'
	make -j 16
	make check

The regression test is hanging at the very first test simple/angles1 (no additional output in mdrun.out or):

$ ps axf

106963 ?        S      0:00                  \_ make check
107009 ?        S      0:00                      \_ make -f CMakeFiles/Makefile2 check
107012 ?        S      0:00                          \_ make -f CMakeFiles/Makefile2 CMakeFiles/check.dir/all
107150 ?        S      0:00                              \_ make -f CMakeFiles/check.dir/build.make CMakeFiles/check.dir/build
107151 ?        S      0:00                                  \_ /path/to/CMake/2.8.12-goolfc-2.6.10/bin/ctest --output-on-failure
107152 ?        S      0:00                                      \_ /usr/bin/perl /tmp/regressiontests-4.6.5/gmxtest.pl simple -np 8 -suffix _mpi -crosscompile -noverbose -nosuffix
107187 ?        S      0:00                                          \_ sh -c mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi    -notunepme -table ../table -tablep ../tablep >mdrun.out 2>&1
107188 ?        S      0:00                                              \_ mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi -notunepme -table ../table -tablep ../tablep
107189 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107190 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107191 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107192 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107193 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107194 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107195 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
107196 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep

$ tail mdrun.out

Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity 

$ tail -5 md.log
The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           0.800 nm
            two-body bonded interactions  (-rdd)   0.800 nm
          multi-body bonded interactions  (-rdd)   0.493 nm


strace reveals that the different MPI processes are stuck polling (each other?):

$ strace -s 128 -x -p 107189 2>&1 | head -5
Process 107189 attached - interrupt to quit
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)

$ strace -s 128 -x -p 107193 2>&1 | head -5
Process 107193 attached - interrupt to quit
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)


Has anyone run into problems like this?

Pointers to get to the bottom of this are very much welcome...


regards,

Kenneth


More information about the gromacs.org_gmx-users mailing list