[gmx-users] hanging regression test with MPI-enabled GROMACS v4.6.1 or v4.6.5 (GCC/OpenMPI/OpenBLAS/FFTW)

Kenneth Hoste kenneth.hoste at ugent.be
Mon Dec 23 00:04:10 CET 2013


A small update on this:

* I also ran into this issue with MPICH2, but things were fine and dandy with MVAPICH2, no idea why
* the issue was resolved by setting the environment variable $OMP_NUM_THREADS to 1, which suggests this is a thread-safety issue (not sure where though)


K.

On 20 Dec 2013, at 22:37, Kenneth Hoste wrote:

> Hello,
> 
> I'm having trouble with the GROMACS regression test hanging with a GROMACS built with a GCC/OpenMPI/OpenBLAS/FFTW toolchain, when MPI support is enabled (both with and without OpenMP support).
> The tests work fine when I'm using the exact same build procedure, except for enabling MPI support (an OpenMP-only build works fine).
> 
> When I'm using Intel compilers + Intel MPI + Intel MKL, all is well (MPI or hybrid build works fine), but I'd like to be able to get a working build with GCC+OpenMPI as well.
> 
> I'm seeing these problems with both GROMACS v4.6.1 and v4.6.5, on a Linux 64-bit (Scientific Linux 6), Intel Sandy Bridge system.
> 
> In particular, I'm seeing this issue with the following combinations:
> 
> 	* GCC 4.6.4, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3 (+ CUDA 5.0.35)
> 	* GCC 4.7.2, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3
> 	* GCC 4.8.2, OpenMPI 1.7.3, OpenBLAS 0.2.8, FFTW 3.3.3 (+ CUDA 5.5.22)
> 
> GROMACS is being built with the following commands (for a hybrid build with both MPI and OpenMP enabled)
> 
> 	cmake . -DCMAKE_INSTALL_PREFIX=/tmp -DCMAKE_C_COMPILER='mpicc' -DCMAKE_Fortran_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_FLAGS='-fopenmp -O2 -march=native' -DCMAKE_CXX_COMPILER='mpicxx' -DCMAKE_Fortran_COMPILER='mpif90' -DCMAKE_C_FLAGS='-fopenmp -O2 -march=native'  -DCMAKE_BUILD_TYPE=Debug  -DGMX_PREFER_STATIC_LIBS=ON  -DGMX_EXTERNAL_BLAS=ON -DGMX_EXTERNAL_LAPACK=ON  -DGMX_X11=OFF  -DGMX_OPENMP=ON  -DGMX_MPI=ON -DGMX_THREAD_MPI=OFF  -DGMX_GPU=OFF  -DGMX_BLAS_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran"  -DGMX_LAPACK_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib -lopenblas -lgfortran"  -DREGRESSIONTEST_PATH='/tmp/regressiontests-4.6.5'
> 	make -j 16
> 	make check
> 
> The regression test is hanging at the very first test simple/angles1 (no additional output in mdrun.out or):
> 
> $ ps axf
> 
> 106963 ?        S      0:00                  \_ make check
> 107009 ?        S      0:00                      \_ make -f CMakeFiles/Makefile2 check
> 107012 ?        S      0:00                          \_ make -f CMakeFiles/Makefile2 CMakeFiles/check.dir/all
> 107150 ?        S      0:00                              \_ make -f CMakeFiles/check.dir/build.make CMakeFiles/check.dir/build
> 107151 ?        S      0:00                                  \_ /path/to/CMake/2.8.12-goolfc-2.6.10/bin/ctest --output-on-failure
> 107152 ?        S      0:00                                      \_ /usr/bin/perl /tmp/regressiontests-4.6.5/gmxtest.pl simple -np 8 -suffix _mpi -crosscompile -noverbose -nosuffix
> 107187 ?        S      0:00                                          \_ sh -c mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi    -notunepme -table ../table -tablep ../tablep >mdrun.out 2>&1
> 107188 ?        S      0:00                                              \_ mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107189 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107190 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107191 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107192 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107193 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107194 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107195 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 107196 ?        RLl    5:49                                                  \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> 
> $ tail mdrun.out
> 
> Non-default thread affinity set probably by the OpenMP library,
> disabling internal thread affinity 
> 
> $ tail -5 md.log
> The maximum allowed distance for charge groups involved in interactions is:
>                 non-bonded interactions           0.800 nm
>            two-body bonded interactions  (-rdd)   0.800 nm
>          multi-body bonded interactions  (-rdd)   0.493 nm
> 
> 
> strace reveals that the different MPI processes are stuck polling (each other?):
> 
> $ strace -s 128 -x -p 107189 2>&1 | head -5
> Process 107189 attached - interrupt to quit
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24, events=POLLIN}], 6, 0) = 0 (Timeout)
> 
> $ strace -s 128 -x -p 107193 2>&1 | head -5
> Process 107193 attached - interrupt to quit
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
> 
> 
> Has anyone run into problems like this?
> 
> Pointers to get to the bottom of this are very much welcome...
> 
> 
> regards,
> 
> Kenneth
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.



More information about the gromacs.org_gmx-users mailing list