[gmx-users] Gromacs 5.1 and 5.1.1 crash in REMD

Thu Nov 19 20:05:42 CET 2015

Dear Justin and Mark

Thanks for your helpful suggestions.
Yes, my case is just like bug 1848
Our computing staff recompiled the 5.1.1 code with the debugger an ran  
a backtrace
on my job, concluding that there is a bug in the code.
I include their conclusions in case that might help resolve the problem- 
condensed backtrace follows:
Krzysztof

[67] 0x00000000007a10bd in add_binr (b=0x25f11c0, nr=9, r=0x0) at
/home/wmason/gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
[4-7,12-15,20-31,60-67] 94 rbuf[i] = r[i];

0x0000000000725758 in global_stat (fplog=0x3543750, gs=0x3639420, cr=0x3536fa0,
enerd=0x36398b0, fvir=0x0, svir=0x0, mu_tot=0x7fff6a2dfe6c, inputrec=0x35424f0,
ekind=0x3635990, constr=0x363b920, vcm=0x0, nsig=0, sig=0x0,
top_global=0x3541860, state_local=0x363a090, bSumEkinhOld=0, flags=146) at
/home/wmason/gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229

0x000000000073efcd in compute_globals (fplog=0x3543750, gstat=0x3639420,
cr=0x3536fa0, ir=0x35424f0, fr=0x3599df0, ekind=0x3635990, state=0x363a090,
state_global=0x3543270, mdatoms=0x35cc860, nrnb=0x3599a20, vcm=0x3623b40,
wcycle=0x3599340, enerd=0x36398b0, force_vir=0x0, shake_vir=0x0, total_vir=0x0,
pres=0x0, mu_tot=0x7fff6a2dfe6c, constr=0x363b920, gs=0x0, bInterSimGS=0,
box=0x363a0b0, top_global=0x3541860, bSumEkinhOld=0x7fff6a2dff10, flags=146) at
/home/wmason/gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342

0x00000000004c3dca in do_md (fplog=0x0, cr=0x24a7fb0, nfile=35,
fnm=0x7fffa95b48a8, oenv=0x24b46e0, bVerbose=0, bCompact=1, nstglobalcomm=20,
vsite=0x252d890, constr=0x25e9a80, stepout=100, ir=0x24b2430,
top_global=0x24b4760, fcd=0x24ea8b0, state_global=0x24b31c0, mdatoms=0x252d980,
nrnb=0x24fa9b0, wcycle=0x24fa1b0, ed=0x0, fr=0x24fad80, repl_ex_nst=500,
repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1,
imdport=8888, Flags=1055744, walltime_accounting=0x2584a20) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/md.cpp:969

0x00000000004d4a64 in mdrunner (hw_opt=0x7fff6a2e1d58, fplog=0x3543750,
cr=0x3536fa0, nfile=35, fnm=0x7fff6a2e15f8, oenv=0x35436d0, bVerbose=0,
bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff6a2e11bc, dd_node_order=1, rdd=0,
rconstr=0, dddlb_opt=0x1f3b10c "auto", dlb_scale=0.800000012, ddcsx=0x0,
ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x1f3b10c "auto", nstlist_cmdline=0,
nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=40, repl_ex_nst=500,
repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1,
imdport=8888, Flags=1055744) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270

0x00000000004cb637 in gmx_mdrun (argc=15, argv=0x3531c20) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537

0x000000000050b26b in gmx::CommandLineModuleManager::runAsMainCMain (argc=15,
argv=0x7fffa95b5aa8, mainFunction=0x4c8d73 <gmx_mdrun(int, char**)>) at
/home/wmason/gromacs-5.1.1/src/gromacs/commandline/cmdlinemodulemanager.cpp:588

0x00000000004ba316 in main (argc=15, argv=0x7fff6a2e27f8) at
/home/wmason/gromacs-5.1.1/src/programs/mdrun_main.cpp:43

The error is a classic segmentation fault--caused by accessing an array out of
bounds. It's a bug in the Gromacs 5.1.1 code. You will need to file a bug
report with gromacs, and they will need your job input to reproduce the error,
and the backtrace info above, which I've summarized if you just want to read
the code yourself:

"add_rbin" in gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94
called from "global_stat" in gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229
called from "compute_globals" in
gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342
called from "do_md" in /gromacs-5.1.1/src/programs/mdrun/md.cpp:969
called from "mdrunner" in gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270
called from "gmx_mdrun" in gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537

The code which fails is taking a "bin" of the "gmx_global_stat" type, which
holds an array of doubles and the size of the array, and tries to copy in data
from a "tensor" of the force virial, during a step in which energy is computed
globally.

I don't have any clue what this means, except for what the low-level code says
(I almost always have read the source code when debugging so I could try to
explain what's happening). There are multiple possible causes for this error,
such as:
A. The memory allocation for the "bin" fails quietly--the array is not resized
(or not the right size), and then the error occurs at the next function which
tries to write data there.
B. The tensor is actually not the size "DIM*DIM" (3x3 if I'm reading correctly)
that the function expects. Accessing the source tensor array out-of-bounds also
generates this error.

C. The tensor is actually a NULL pointer. This is the most likely explanation,
which one can see from the line:
add_binr (b=0x25f11c0, nr=9, r=0x0)
^---- Either r=NULL or the debugger is not reporting the value correctly.

This would mean the program is calling "do_md" from "mdrunner" with bad
parameters and not checking its parameters for errors. The error actually
occurs at a higher level of the code, rather than the low level where the error
is reported.

On 11/17/15 3:20 PM, Justin Lemkul wrote:
>
>
> On 11/17/15 3:00 PM, Mark Abraham wrote:
>> Hi,
>>
>> That is indeed strange. MPI_Allreduce isn't used in replica exchange, 
>> nor
>> did the replica-exchange code change between 5.0.6 and 5.1, so the 
>> problem
>> is elsewhere. You could try running with the environment variable
>> GMX_CYCLE_BARRIER set to 1 (which might require you to tell mpirun 
>> that's
>> what you want) so that we can localize which MPI_Allreduce is losing a
>> process. Or any other way you might have available to get a stack trace
>> from each process.
>>
>
> Maybe related to this?
>
> http://redmine.gromacs.org/issues/1848
>
> -Justin
>
>> Mark
>>
>> On Tue, Nov 17, 2015 at 6:11 PM Krzysztof Kuczera <kkuczera at ku.edu> 
>> wrote:
>>
>>> Hi
>>> I am trying to run a temperature-exchange REMD simulation with GROMACS
>>> 5.1 or 5.1.1
>>> and my job is crashing in a way difficult to explain
>>> - the MD part works fine
>>> - crash occurs at first replica-exchange attempt
>>> - error log contains a bunch of messages of type, which I suppose mean
>>> that the MPI communication
>>>      did not work
>>>
>>> NOTE: Turning on dynamic load balancingFatal error in MPI_Allreduce: A
>>> process has failed, error stack:MPI_Allreduce(1421).......:
>>> MPI_Allreduce(sbuf=0x7fff5538018c, rbuf=0x28b2070, count=3, MPI_FLOAT,
>>> MPI_SUM, comm=0x84000002) failed
>>> MPIR_Allreduce_impl(1262).:MPIR_Allreduce_intra(497).:
>>> MPIR_Bcast_binomial(245)..:dequeue_and_set_error(917): Communication
>>> error with rank 48Fatal error in MPI_Allreduce: Other MPI error, error
>>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff31eb660c,
>>> rbuf=0x2852c00, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000001)
>>> failedMPIR_Allreduce_impl(1262):
>>> MPIR_Allreduce_intra(497):
>>> MPIR_Bcast_binomial(316).: Failure during collective
>>> Fatal error in MPI_Allreduce: Other MPI error, error
>>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff2e54068c,
>>> rbuf=0x31e35a0, count=3, MPI_FLOAT, MPI
>>> _SUM, comm=0x84000001) failed
>>>
>>>
>>> Recently compiled slightly older versions like 5.0.6 do not have this
>>> behavior.
>>> I have tried updating to latest cmake, compiler and MPI versions on our
>>> system,
>>> but it does not change things.
>>> Does anyone have suggestions how to fix this?
>>>
>>> Thanks
>>> Krzysztof
>>>
>>> -- 
>>> Krzysztof Kuczera
>>> Departments of Chemistry and Molecular Biosciences
>>> The University of Kansas
>>> 1251 Wescoe Hall Drive, 5090 Malott Hall
>>> Lawrence, KS 66045
>>> Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
>>> http://oolung.chem.ku.edu/~kuczera/home.html
>>>
>>> -- 
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>>>
>

-- 
Krzysztof Kuczera
Departments of Chemistry and Molecular Biosciences
The University of Kansas
1251 Wescoe Hall Drive, 5090 Malott Hall
Lawrence, KS 66045
Tel: 785-864-5060 Fax: 785-864-5396 email: kkuczera at ku.edu
http://oolung.chem.ku.edu/~kuczera/home.html