[gmx-users] Gromacs 5.1 and 5.1.1 crash in REMD

Justin Lemkul jalemkul at vt.edu
Tue Nov 17 22:20:38 CET 2015



On 11/17/15 3:00 PM, Mark Abraham wrote:
> Hi,
>
> That is indeed strange. MPI_Allreduce isn't used in replica exchange, nor
> did the replica-exchange code change between 5.0.6 and 5.1, so the problem
> is elsewhere. You could try running with the environment variable
> GMX_CYCLE_BARRIER set to 1 (which might require you to tell mpirun that's
> what you want) so that we can localize which MPI_Allreduce is losing a
> process. Or any other way you might have available to get a stack trace
> from each process.
>

Maybe related to this?

http://redmine.gromacs.org/issues/1848

-Justin

> Mark
>
> On Tue, Nov 17, 2015 at 6:11 PM Krzysztof Kuczera <kkuczera at ku.edu> wrote:
>
>> Hi
>> I am trying to run a temperature-exchange REMD simulation with GROMACS
>> 5.1 or 5.1.1
>> and my job is crashing in a way difficult to explain
>> - the MD part works fine
>> - crash occurs at first replica-exchange attempt
>> - error log contains a bunch of messages of type, which I suppose mean
>> that the MPI communication
>>      did not work
>>
>> NOTE: Turning on dynamic load balancingFatal error in MPI_Allreduce: A
>> process has failed, error stack:MPI_Allreduce(1421).......:
>> MPI_Allreduce(sbuf=0x7fff5538018c, rbuf=0x28b2070, count=3, MPI_FLOAT,
>> MPI_SUM, comm=0x84000002) failed
>> MPIR_Allreduce_impl(1262).:MPIR_Allreduce_intra(497).:
>> MPIR_Bcast_binomial(245)..:dequeue_and_set_error(917): Communication
>> error with rank 48Fatal error in MPI_Allreduce: Other MPI error, error
>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff31eb660c,
>> rbuf=0x2852c00, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000001)
>> failedMPIR_Allreduce_impl(1262):
>> MPIR_Allreduce_intra(497):
>> MPIR_Bcast_binomial(316).: Failure during collective
>> Fatal error in MPI_Allreduce: Other MPI error, error
>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff2e54068c,
>> rbuf=0x31e35a0, count=3, MPI_FLOAT, MPI
>> _SUM, comm=0x84000001) failed
>>
>>
>> Recently compiled slightly older versions like 5.0.6 do not have this
>> behavior.
>> I have tried updating to latest cmake, compiler and MPI versions on our
>> system,
>> but it does not change things.
>> Does anyone have suggestions how to fix this?
>>
>> Thanks
>> Krzysztof
>>
>> --
>> Krzysztof Kuczera
>> Departments of Chemistry and Molecular Biosciences
>> The University of Kansas
>> 1251 Wescoe Hall Drive, 5090 Malott Hall
>> Lawrence, KS 66045
>> Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
>> http://oolung.chem.ku.edu/~kuczera/home.html
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>

-- 
==================================================

Justin A. Lemkul, Ph.D.
Ruth L. Kirschstein NRSA Postdoctoral Fellow

Department of Pharmaceutical Sciences
School of Pharmacy
Health Sciences Facility II, Room 629
University of Maryland, Baltimore
20 Penn St.
Baltimore, MD 21201

jalemkul at outerbanks.umaryland.edu | (410) 706-7441
http://mackerell.umaryland.edu/~jalemkul

==================================================


More information about the gromacs.org_gmx-users mailing list