[gmx-users] hanging regression test with MPI-enabled GROMACS v4.6.1 or v4.6.5 (GCC/OpenMPI/OpenBLAS/FFTW)

Mark Abraham mark.j.abraham at gmail.com
Mon Dec 23 04:06:03 CET 2013


Hi,

Thanks for the report. For diagnosis, I would suggest using the GROMACS
internal BLAS and LAPACK, since those you are supplying seem like they
might have OpenMP support. Such support would be likely not useful for
GROMACS, and is perhaps a contributor to this problem. Accordingly, don't
do anything with Fortran (since it is then useless for GROMACS) and don't
touch the C++ flags (as now there is no reason to do so).

Mark


On Mon, Dec 23, 2013 at 10:03 AM, Kenneth Hoste <kenneth.hoste at ugent.be>wrote:

> A small update on this:
>
> * I also ran into this issue with MPICH2, but things were fine and dandy
> with MVAPICH2, no idea why
> * the issue was resolved by setting the environment variable
> $OMP_NUM_THREADS to 1, which suggests this is a thread-safety issue (not
> sure where though)
>
>
> K.
>
> On 20 Dec 2013, at 22:37, Kenneth Hoste wrote:
>
> > Hello,
> >
> > I'm having trouble with the GROMACS regression test hanging with a
> GROMACS built with a GCC/OpenMPI/OpenBLAS/FFTW toolchain, when MPI support
> is enabled (both with and without OpenMP support).
> > The tests work fine when I'm using the exact same build procedure,
> except for enabling MPI support (an OpenMP-only build works fine).
> >
> > When I'm using Intel compilers + Intel MPI + Intel MKL, all is well (MPI
> or hybrid build works fine), but I'd like to be able to get a working build
> with GCC+OpenMPI as well.
> >
> > I'm seeing these problems with both GROMACS v4.6.1 and v4.6.5, on a
> Linux 64-bit (Scientific Linux 6), Intel Sandy Bridge system.
> >
> > In particular, I'm seeing this issue with the following combinations:
> >
> >       * GCC 4.6.4, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3 (+ CUDA
> 5.0.35)
> >       * GCC 4.7.2, OpenMPI 1.6.4, OpenBLAS 0.2.6, FFTW 3.3.3
> >       * GCC 4.8.2, OpenMPI 1.7.3, OpenBLAS 0.2.8, FFTW 3.3.3 (+ CUDA
> 5.5.22)
> >
> > GROMACS is being built with the following commands (for a hybrid build
> with both MPI and OpenMP enabled)
> >
> >       cmake . -DCMAKE_INSTALL_PREFIX=/tmp -DCMAKE_C_COMPILER='mpicc'
> -DCMAKE_Fortran_FLAGS='-fopenmp -O2 -march=native'
> -DCMAKE_CXX_FLAGS='-fopenmp -O2 -march=native'
> -DCMAKE_CXX_COMPILER='mpicxx' -DCMAKE_Fortran_COMPILER='mpif90'
> -DCMAKE_C_FLAGS='-fopenmp -O2 -march=native'  -DCMAKE_BUILD_TYPE=Debug
>  -DGMX_PREFER_STATIC_LIBS=ON  -DGMX_EXTERNAL_BLAS=ON
> -DGMX_EXTERNAL_LAPACK=ON  -DGMX_X11=OFF  -DGMX_OPENMP=ON  -DGMX_MPI=ON
> -DGMX_THREAD_MPI=OFF  -DGMX_GPU=OFF
>  -DGMX_BLAS_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib
> -lopenblas -lgfortran"
>  -DGMX_LAPACK_USER="-L/path/to/OpenBLAS/0.2.6-gompi-1.4.10-LAPACK-3.4.2/lib
> -lopenblas -lgfortran"  -DREGRESSIONTEST_PATH='/tmp/regressiontests-4.6.5'
> >       make -j 16
> >       make check
> >
> > The regression test is hanging at the very first test simple/angles1 (no
> additional output in mdrun.out or):
> >
> > $ ps axf
> >
> > 106963 ?        S      0:00                  \_ make check
> > 107009 ?        S      0:00                      \_ make -f
> CMakeFiles/Makefile2 check
> > 107012 ?        S      0:00                          \_ make -f
> CMakeFiles/Makefile2 CMakeFiles/check.dir/all
> > 107150 ?        S      0:00                              \_ make -f
> CMakeFiles/check.dir/build.make CMakeFiles/check.dir/build
> > 107151 ?        S      0:00                                  \_
> /path/to/CMake/2.8.12-goolfc-2.6.10/bin/ctest --output-on-failure
> > 107152 ?        S      0:00                                      \_
> /usr/bin/perl /tmp/regressiontests-4.6.5/gmxtest.pl simple -np 8 -suffix
> _mpi -crosscompile -noverbose -nosuffix
> > 107187 ?        S      0:00                                          \_
> sh -c mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1
> mdrun_mpi    -notunepme -table ../table -tablep ../tablep >mdrun.out 2>&1
> > 107188 ?        S      0:00
>  \_ mpirun -np 8 -wdir /tmp//regressiontests-4.6.5/simple/angles1 mdrun_mpi
> -notunepme -table ../table -tablep ../tablep
> > 107189 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107190 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107191 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107192 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107193 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107194 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107195 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> > 107196 ?        RLl    5:49
>      \_ mdrun_mpi -notunepme -table ../table -tablep ../tablep
> >
> > $ tail mdrun.out
> >
> > Non-default thread affinity set probably by the OpenMP library,
> > disabling internal thread affinity
> >
> > $ tail -5 md.log
> > The maximum allowed distance for charge groups involved in interactions
> is:
> >                 non-bonded interactions           0.800 nm
> >            two-body bonded interactions  (-rdd)   0.800 nm
> >          multi-body bonded interactions  (-rdd)   0.493 nm
> >
> >
> > strace reveals that the different MPI processes are stuck polling (each
> other?):
> >
> > $ strace -s 128 -x -p 107189 2>&1 | head -5
> > Process 107189 attached - interrupt to quit
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=24,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> >
> > $ strace -s 128 -x -p 107193 2>&1 | head -5
> > Process 107193 attached - interrupt to quit
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
> events=POLLIN}, {fd=9, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23,
> events=POLLIN}], 6, 0) = 0 (Timeout)
> >
> >
> > Has anyone run into problems like this?
> >
> > Pointers to get to the bottom of this are very much welcome...
> >
> >
> > regards,
> >
> > Kenneth
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list