[gmx-users] Gromacs 5.1 and 5.1.1 crash in REMD

Mark Abraham mark.j.abraham at gmail.com
Tue Nov 17 21:01:02 CET 2015


Hi,

That is indeed strange. MPI_Allreduce isn't used in replica exchange, nor
did the replica-exchange code change between 5.0.6 and 5.1, so the problem
is elsewhere. You could try running with the environment variable
GMX_CYCLE_BARRIER set to 1 (which might require you to tell mpirun that's
what you want) so that we can localize which MPI_Allreduce is losing a
process. Or any other way you might have available to get a stack trace
from each process.

Mark

On Tue, Nov 17, 2015 at 6:11 PM Krzysztof Kuczera <kkuczera at ku.edu> wrote:

> Hi
> I am trying to run a temperature-exchange REMD simulation with GROMACS
> 5.1 or 5.1.1
> and my job is crashing in a way difficult to explain
> - the MD part works fine
> - crash occurs at first replica-exchange attempt
> - error log contains a bunch of messages of type, which I suppose mean
> that the MPI communication
>     did not work
>
> NOTE: Turning on dynamic load balancingFatal error in MPI_Allreduce: A
> process has failed, error stack:MPI_Allreduce(1421).......:
> MPI_Allreduce(sbuf=0x7fff5538018c, rbuf=0x28b2070, count=3, MPI_FLOAT,
> MPI_SUM, comm=0x84000002) failed
> MPIR_Allreduce_impl(1262).:MPIR_Allreduce_intra(497).:
> MPIR_Bcast_binomial(245)..:dequeue_and_set_error(917): Communication
> error with rank 48Fatal error in MPI_Allreduce: Other MPI error, error
> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff31eb660c,
> rbuf=0x2852c00, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000001)
> failedMPIR_Allreduce_impl(1262):
> MPIR_Allreduce_intra(497):
> MPIR_Bcast_binomial(316).: Failure during collective
> Fatal error in MPI_Allreduce: Other MPI error, error
> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff2e54068c,
> rbuf=0x31e35a0, count=3, MPI_FLOAT, MPI
> _SUM, comm=0x84000001) failed
>
>
> Recently compiled slightly older versions like 5.0.6 do not have this
> behavior.
> I have tried updating to latest cmake, compiler and MPI versions on our
> system,
> but it does not change things.
> Does anyone have suggestions how to fix this?
>
> Thanks
> Krzysztof
>
> --
> Krzysztof Kuczera
> Departments of Chemistry and Molecular Biosciences
> The University of Kansas
> 1251 Wescoe Hall Drive, 5090 Malott Hall
> Lawrence, KS 66045
> Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
> http://oolung.chem.ku.edu/~kuczera/home.html
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list