[gmx-users] Gromacs 5.1 and 5.1.1 crash in REMD

Fri Nov 20 21:52:29 CET 2015

Hi,

Yes, that's a real bug. I'm not yet sure what to do about it, but I'll
continue discussion at http://redmine.gromacs.org/issues/1848

Mark

On Thu, Nov 19, 2015 at 8:06 PM Krzysztof Kuczera <kkuczera at ku.edu> wrote:

> Dear Justin and Mark
>
> Thanks for your helpful suggestions.
> Yes, my case is just like bug 1848
> Our computing staff recompiled the 5.1.1 code with the debugger an ran
> a backtrace
> on my job, concluding that there is a bug in the code.
> I include their conclusions in case that might help resolve the problem-
> condensed backtrace follows:
> Krzysztof
>
>
> [67] 0x00000000007a10bd in add_binr (b=0x25f11c0, nr=9, r=0x0) at
> /home/wmason/gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
> [4-7,12-15,20-31,60-67] 94 rbuf[i] = r[i];
>
> 0x0000000000725758 in global_stat (fplog=0x3543750, gs=0x3639420,
> cr=0x3536fa0,
> enerd=0x36398b0, fvir=0x0, svir=0x0, mu_tot=0x7fff6a2dfe6c,
> inputrec=0x35424f0,
> ekind=0x3635990, constr=0x363b920, vcm=0x0, nsig=0, sig=0x0,
> top_global=0x3541860, state_local=0x363a090, bSumEkinhOld=0, flags=146) at
> /home/wmason/gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229
>
> 0x000000000073efcd in compute_globals (fplog=0x3543750, gstat=0x3639420,
> cr=0x3536fa0, ir=0x35424f0, fr=0x3599df0, ekind=0x3635990, state=0x363a090,
> state_global=0x3543270, mdatoms=0x35cc860, nrnb=0x3599a20, vcm=0x3623b40,
> wcycle=0x3599340, enerd=0x36398b0, force_vir=0x0, shake_vir=0x0,
> total_vir=0x0,
> pres=0x0, mu_tot=0x7fff6a2dfe6c, constr=0x363b920, gs=0x0, bInterSimGS=0,
> box=0x363a0b0, top_global=0x3541860, bSumEkinhOld=0x7fff6a2dff10,
> flags=146) at
> /home/wmason/gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342
>
> 0x00000000004c3dca in do_md (fplog=0x0, cr=0x24a7fb0, nfile=35,
> fnm=0x7fffa95b48a8, oenv=0x24b46e0, bVerbose=0, bCompact=1,
> nstglobalcomm=20,
> vsite=0x252d890, constr=0x25e9a80, stepout=100, ir=0x24b2430,
> top_global=0x24b4760, fcd=0x24ea8b0, state_global=0x24b31c0,
> mdatoms=0x252d980,
> nrnb=0x24fa9b0, wcycle=0x24fa1b0, ed=0x0, fr=0x24fad80, repl_ex_nst=500,
> repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1,
> imdport=8888, Flags=1055744, walltime_accounting=0x2584a20) at
> /home/wmason/gromacs-5.1.1/src/programs/mdrun/md.cpp:969
>
> 0x00000000004d4a64 in mdrunner (hw_opt=0x7fff6a2e1d58, fplog=0x3543750,
> cr=0x3536fa0, nfile=35, fnm=0x7fff6a2e15f8, oenv=0x35436d0, bVerbose=0,
> bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff6a2e11bc, dd_node_order=1, rdd=0,
> rconstr=0, dddlb_opt=0x1f3b10c "auto", dlb_scale=0.800000012, ddcsx=0x0,
> ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x1f3b10c "auto", nstlist_cmdline=0,
> nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=40,
> repl_ex_nst=500,
> repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1,
> imdport=8888, Flags=1055744) at
> /home/wmason/gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270
>
> 0x00000000004cb637 in gmx_mdrun (argc=15, argv=0x3531c20) at
> /home/wmason/gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537
>
> 0x000000000050b26b in gmx::CommandLineModuleManager::runAsMainCMain
> (argc=15,
> argv=0x7fffa95b5aa8, mainFunction=0x4c8d73 <gmx_mdrun(int, char**)>) at
>
> /home/wmason/gromacs-5.1.1/src/gromacs/commandline/cmdlinemodulemanager.cpp:588
>
> 0x00000000004ba316 in main (argc=15, argv=0x7fff6a2e27f8) at
> /home/wmason/gromacs-5.1.1/src/programs/mdrun_main.cpp:43
>
>
> The error is a classic segmentation fault--caused by accessing an array
> out of
> bounds. It's a bug in the Gromacs 5.1.1 code. You will need to file a bug
> report with gromacs, and they will need your job input to reproduce the
> error,
> and the backtrace info above, which I've summarized if you just want to
> read
> the code yourself:
>
> "add_rbin" in gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
> called from "global_stat" in gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229
> called from "compute_globals" in
> gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342
> called from "do_md" in /gromacs-5.1.1/src/programs/mdrun/md.cpp:969
> called from "mdrunner" in gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270
> called from "gmx_mdrun" in gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537
>
>
> The code which fails is taking a "bin" of the "gmx_global_stat" type, which
> holds an array of doubles and the size of the array, and tries to copy in
> data
> from a "tensor" of the force virial, during a step in which energy is
> computed
> globally.
>
> I don't have any clue what this means, except for what the low-level code
> says
> (I almost always have read the source code when debugging so I could try to
> explain what's happening). There are multiple possible causes for this
> error,
> such as:
> A. The memory allocation for the "bin" fails quietly--the array is not
> resized
> (or not the right size), and then the error occurs at the next function
> which
> tries to write data there.
> B. The tensor is actually not the size "DIM*DIM" (3x3 if I'm reading
> correctly)
> that the function expects. Accessing the source tensor array out-of-bounds
> also
> generates this error.
>
> C. The tensor is actually a NULL pointer. This is the most likely
> explanation,
> which one can see from the line:
> add_binr (b=0x25f11c0, nr=9, r=0x0)
> ^---- Either r=NULL or the debugger is not reporting the value correctly.
>
> This would mean the program is calling "do_md" from "mdrunner" with bad
> parameters and not checking its parameters for errors. The error actually
> occurs at a higher level of the code, rather than the low level where the
> error
> is reported.
>
>
>
>
>
> On 11/17/15 3:20 PM, Justin Lemkul wrote:
> >
> >
> > On 11/17/15 3:00 PM, Mark Abraham wrote:
> >> Hi,
> >>
> >> That is indeed strange. MPI_Allreduce isn't used in replica exchange,
> >> nor
> >> did the replica-exchange code change between 5.0.6 and 5.1, so the
> >> problem
> >> is elsewhere. You could try running with the environment variable
> >> GMX_CYCLE_BARRIER set to 1 (which might require you to tell mpirun
> >> that's
> >> what you want) so that we can localize which MPI_Allreduce is losing a
> >> process. Or any other way you might have available to get a stack trace
> >> from each process.
> >>
> >
> > Maybe related to this?
> >
> > http://redmine.gromacs.org/issues/1848
> >
> > -Justin
> >
> >> Mark
> >>
> >> On Tue, Nov 17, 2015 at 6:11 PM Krzysztof Kuczera <kkuczera at ku.edu>
> >> wrote:
> >>
> >>> Hi
> >>> I am trying to run a temperature-exchange REMD simulation with GROMACS
> >>> 5.1 or 5.1.1
> >>> and my job is crashing in a way difficult to explain
> >>> - the MD part works fine
> >>> - crash occurs at first replica-exchange attempt
> >>> - error log contains a bunch of messages of type, which I suppose mean
> >>> that the MPI communication
> >>>      did not work
> >>>
> >>> NOTE: Turning on dynamic load balancingFatal error in MPI_Allreduce: A
> >>> process has failed, error stack:MPI_Allreduce(1421).......:
> >>> MPI_Allreduce(sbuf=0x7fff5538018c, rbuf=0x28b2070, count=3, MPI_FLOAT,
> >>> MPI_SUM, comm=0x84000002) failed
> >>> MPIR_Allreduce_impl(1262).:MPIR_Allreduce_intra(497).:
> >>> MPIR_Bcast_binomial(245)..:dequeue_and_set_error(917): Communication
> >>> error with rank 48Fatal error in MPI_Allreduce: Other MPI error, error
> >>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff31eb660c,
> >>> rbuf=0x2852c00, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000001)
> >>> failedMPIR_Allreduce_impl(1262):
> >>> MPIR_Allreduce_intra(497):
> >>> MPIR_Bcast_binomial(316).: Failure during collective
> >>> Fatal error in MPI_Allreduce: Other MPI error, error
> >>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff2e54068c,
> >>> rbuf=0x31e35a0, count=3, MPI_FLOAT, MPI
> >>> _SUM, comm=0x84000001) failed
> >>>
> >>>
> >>> Recently compiled slightly older versions like 5.0.6 do not have this
> >>> behavior.
> >>> I have tried updating to latest cmake, compiler and MPI versions on our
> >>> system,
> >>> but it does not change things.
> >>> Does anyone have suggestions how to fix this?
> >>>
> >>> Thanks
> >>> Krzysztof
> >>>
> >>> --
> >>> Krzysztof Kuczera
> >>> Departments of Chemistry and Molecular Biosciences
> >>> The University of Kansas
> >>> 1251 Wescoe Hall Drive, 5090 Malott Hall
> >>> Lawrence, KS 66045
> >>> Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
> >>> http://oolung.chem.ku.edu/~kuczera/home.html
> >>>
> >>> --
> >>> Gromacs Users mailing list
> >>>
> >>> * Please search the archive at
> >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >>> posting!
> >>>
> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>
> >>> * For (un)subscribe requests visit
> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >>> send a mail to gmx-users-request at gromacs.org.
> >>>
> >
>
>
> --
> Krzysztof Kuczera
> Departments of Chemistry and Molecular Biosciences
> The University of Kansas
> 1251 Wescoe Hall Drive, 5090 Malott Hall
> Lawrence, KS 66045
> Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
> http://oolung.chem.ku.edu/~kuczera/home.html
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>