[gmx-developers] Lost particles while sorting

Carsten Kutzner ckutzne at gwdg.de
Fri Nov 8 14:58:18 CET 2013


Hi Berk,

On 11/08/2013 02:48 PM, Berk Hess wrote:
> Hi,
>
> I assume this is with GPUs.
> If you run in a debugger, break on exit, can you tell me which 
> sort_atoms call this comes from?
>
It comes from sort_columns_supersub:


Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff5d17920 in exit () from /lib64/libc.so.6
#1  0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-' 
<repeats 55 times>, "\nProgram mdrun, VERSION 
4.6.4-dev-20131107-ba8232e\nSource code file: 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c, l"...) 
at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
#2  0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal", 
msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 - 10.229600)*58.394176) 
= 89, not in 0 - 16*4\n", file=0x1836ad8 
"/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c", 
line=609) at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
#3  0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8 
"/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c", 
line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not in 0 - 
%d*%d\n") at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
#4  0x00000000004e91db in sort_atoms (dim=0, Backwards=0, 
a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296, invh=58.3941765, 
n_per_h=16, sort=0x7ffff10bb2b0) at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
#5  0x00000000004ebcee in sort_columns_supersub (nbs=0x7ffff057f2a0, 
dd_zone=1, grid=0x7ffff06d6f60, a0=61048, a1=74523, 
atinfo=0x7ffff0c51840, x=0x7ffff148a650, nbat=0x7ffff05d3bb0, 
cxy_start=10, cxy_end=15, sort_work=0x7ffff10bb2b0) at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
#6  0x00000000004ecf09 in calc_cell_indices.omp_fn.1 
(.omp_data_i=0x7ffff516f670) at 
/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
#7  0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
#8  0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff5dc42cd in clone () from /lib64/libc.so.6

> On how many MPI ranks is this?
> If I can easily run this, could you mail me the tpr and the run settings?
mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00

Best,
Carsten

>
> Cheers,
>
> Berk
>
> On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>> Hi,
>>
>> using a just checked-out 4.6 branch compiled with debug checks I get
>>
>> -------------------------------------------------------
>> Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
>> Source code file: 
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c, 
>> line: 609
>>
>> Fatal error:
>> (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not in 0 - 
>> 16*4
>>
>> For more information and tips for troubleshooting, please check the 
>> GROMACS
>> website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>>
>> Carsten
>>
>>
>> On 11/08/2013 02:00 PM, Berk Hess wrote:
>>> On 11/08/2013 01:44 PM, Mark Abraham wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner <ckutzne at gwdg.de 
>>>> <mailto:ckutzne at gwdg.de>> wrote:
>>>>
>>>>     Hi Mark, hi Berk,
>>>>
>>>>     On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se
>>>>     <mailto:hess at kth.se>> wrote:
>>>>
>>>>     > Hi Carsten,
>>>>     >
>>>>     > After how many steps does this happen?
>>>>     this happens immedeately at startup.
>>>>
>>>>     > Could you run with a debug build (or without NDEBUG defined)?
>>>>     > I added a lot of checks, not done with NDEBUG, in the fix for
>>>>     the issue you linked.
>>>>     Will do that now.
>>>>
>>>>     > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>>>>     >> Unclear. 6583c94 is one of your commits. Some very recent
>>>>     stuff has been playing with nstlist and rlist (safely, or so we
>>>>     thought.) Can you reproduce with mainstream release-4-6?
>>>>     This is basically mainstream 4-6, since in my commit I only
>>>>     changed the default behavior of
>>>>     appending to no.
>>>>
>>>>
>>>> Right. What's the mainstream parent commit? I was going to release 
>>>> 4.6.4 today - if you're based off the current tip then maybe we 
>>>> shouldn't. If you're based off code a month back then we know the 
>>>> problem, if any, is of longer standing.
>>> This is 4.6.4-dev which seems to include my fix for the previous 
>>> issue, so this issue is surely present in the current 4-6-release 
>>> branch. It must be due to a somewhat exotic condition, since this 
>>> code is widely used and we haven't had other reports.
>>>
>>> I think it should be easy to track this down with all the debug 
>>> checks in the code.
>>> And if Carsten can send me his system and the conditions to 
>>> reproduce it, I can also help with debugging.
>>>
>>> Cheers,
>>>
>>> Berk
>>>>
>>>> Mark
>>>>
>>>>
>>>>     Carsten
>>>>
>>>>     >>
>>>>     >> Mark
>>>>     >>
>>>>     >>
>>>>     >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner
>>>>     <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>     >> Hi,
>>>>     >>
>>>>     >> we have a 120k atom system that crashes with
>>>>     >>
>>>>     >> ------------------------------------------------------
>>>>     >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>>>>     >> Source code file: /home/c/gromacs/src/mdlib/nbnxn_search.c,
>>>>     line: 685
>>>>     >>
>>>>     >> Software inconsistency error:
>>>>     >> Lost particles while sorting
>>>>     >> For more information and tips for troubleshooting, please
>>>>     check the GROMACS
>>>>     >> website at http://www.gromacs.org/Documentation/Errors
>>>>     >> -------------------------------------------------------
>>>>     >>
>>>>     >> if run with >= 2 MPI processes on a GPU and small values for
>>>>     nstlist. On my workstation,
>>>>     >> nstlist = 34 and larger works, whereas nstlist <= 33 lead to
>>>>     the above problem.
>>>>     >>
>>>>     >> Another system (60k atoms) does not produce this problem, so
>>>>     system size seems
>>>>     >> to matter as well.
>>>>     >>
>>>>     >> Looks like an old ghost:
>>>>     >>
>>>>     >> http://redmine.gromacs.org/issues/1153
>>>>     >>
>>>>     >>
>>>>     >> Should I file a redmine issue?
>>>>     >>
>>>>     >> Carsten
>>>>     >>
>>>>     >>
>>>>     >> --
>>>>     >> gmx-developers mailing list
>>>>     >> gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>     >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>     >> Please don't post (un)subscribe requests to the list. Use
>>>>     the www interface or send it to
>>>>     gmx-developers-request at gromacs.org
>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>     >>
>>>>     >>
>>>>     >>
>>>>     >
>>>>     > --
>>>>     > gmx-developers mailing list
>>>>     > gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>     > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>     > Please don't post (un)subscribe requests to the list. Use the
>>>>     > www interface or send it to
>>>>     gmx-developers-request at gromacs.org
>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>
>>>>     --
>>>>     gmx-developers mailing list
>>>>     gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>     http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>     Please don't post (un)subscribe requests to the list. Use the
>>>>     www interface or send it to gmx-developers-request at gromacs.org
>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/8c87c981/attachment.html>


More information about the gromacs.org_gmx-developers mailing list