[gmx-developers] Lost particles while sorting
Berk Hess
hess at kth.se
Fri Nov 8 15:39:31 CET 2013
I think I can upload a fix around 18:00.
Cheers,
Berk
On 11/08/2013 03:37 PM, Mark Abraham wrote:
> OK, thanks Berk. Szilard and I each have thought of some minor
> documentation things that would be nice to fix also.
>
> I suggest we try to get all that in so we can shift focus away from
> release-4-6, with the 5.0 beta only weeks away now!!
>
> Mark
>
>
> On Fri, Nov 8, 2013 at 3:30 PM, Berk Hess <hess at kth.se
> <mailto:hess at kth.se>> wrote:
>
> Hi,
>
> I think I found the source of the problem, now I have to come up
> with a solution.
> It seems you have a harmonic potential between two atoms at a
> distance of 2.9 nm. This is much longer than the pair-list range.
> So one (non-local) atom ends up beyond the non-local search grid.
> I thought I had accounted for such cases, but apparently not.
> We can probaly simply round down the index to the maximum, but
> that means we can't check for other cases, such as without DD,
> where atoms can't be beyond the grid.
> The fix will be very simple and quick, but I need to think a over
> a bit deeper.
>
> Cheers,
>
> Berk
>
>
> On 11/08/2013 02:58 PM, Carsten Kutzner wrote:
>> Hi Berk,
>>
>> On 11/08/2013 02:48 PM, Berk Hess wrote:
>>> Hi,
>>>
>>> I assume this is with GPUs.
>>> If you run in a debugger, break on exit, can you tell me which
>>> sort_atoms call this comes from?
>>>
>> It comes from sort_columns_supersub:
>>
>>
>> Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>> (gdb) where
>> #0 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>> #1 0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-'
>> <repeats 55 times>, "\nProgram mdrun, VERSION
>> 4.6.4-dev-20131107-ba8232e\nSource code file:
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>> l"...) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
>> #2 0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal",
>> msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 -
>> 10.229600)*58.394176) = 89, not in 0 - 16*4\n", file=0x1836ad8
>> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>> line=609) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
>> #3 0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8
>> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>> line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not
>> in 0 - %d*%d\n") at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
>> #4 0x00000000004e91db in sort_atoms (dim=0, Backwards=0,
>> a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296,
>> invh=58.3941765, n_per_h=16, sort=0x7ffff10bb2b0) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
>> #5 0x00000000004ebcee in sort_columns_supersub
>> (nbs=0x7ffff057f2a0, dd_zone=1, grid=0x7ffff06d6f60, a0=61048,
>> a1=74523, atinfo=0x7ffff0c51840, x=0x7ffff148a650,
>> nbat=0x7ffff05d3bb0, cxy_start=10, cxy_end=15,
>> sort_work=0x7ffff10bb2b0) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
>> #6 0x00000000004ecf09 in calc_cell_indices.omp_fn.1
>> (.omp_data_i=0x7ffff516f670) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
>> #7 0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
>> #8 0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
>> #9 0x00007ffff5dc42cd in clone () from /lib64/libc.so.6
>>
>>> On how many MPI ranks is this?
>>> If I can easily run this, could you mail me the tpr and the run
>>> settings?
>> mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00
>>
>> Best,
>> Carsten
>>
>>>
>>> Cheers,
>>>
>>> Berk
>>>
>>> On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>>>> Hi,
>>>>
>>>> using a just checked-out 4.6 branch compiled with debug checks
>>>> I get
>>>>
>>>> -------------------------------------------------------
>>>> Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
>>>> Source code file:
>>>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>>>> line: 609
>>>>
>>>> Fatal error:
>>>> (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not
>>>> in 0 - 16*4
>>>>
>>>> For more information and tips for troubleshooting, please check
>>>> the GROMACS
>>>> website at http://www.gromacs.org/Documentation/Errors
>>>> -------------------------------------------------------
>>>>
>>>> Carsten
>>>>
>>>>
>>>> On 11/08/2013 02:00 PM, Berk Hess wrote:
>>>>> On 11/08/2013 01:44 PM, Mark Abraham wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner
>>>>>> <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>>>
>>>>>> Hi Mark, hi Berk,
>>>>>>
>>>>>> On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se
>>>>>> <mailto:hess at kth.se>> wrote:
>>>>>>
>>>>>> > Hi Carsten,
>>>>>> >
>>>>>> > After how many steps does this happen?
>>>>>> this happens immedeately at startup.
>>>>>>
>>>>>> > Could you run with a debug build (or without NDEBUG
>>>>>> defined)?
>>>>>> > I added a lot of checks, not done with NDEBUG, in the
>>>>>> fix for the issue you linked.
>>>>>> Will do that now.
>>>>>>
>>>>>> > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>>>>>> >> Unclear. 6583c94 is one of your commits. Some very
>>>>>> recent stuff has been playing with nstlist and rlist
>>>>>> (safely, or so we thought.) Can you reproduce with
>>>>>> mainstream release-4-6?
>>>>>> This is basically mainstream 4-6, since in my commit I
>>>>>> only changed the default behavior of
>>>>>> appending to no.
>>>>>>
>>>>>>
>>>>>> Right. What's the mainstream parent commit? I was going to
>>>>>> release 4.6.4 today - if you're based off the current tip
>>>>>> then maybe we shouldn't. If you're based off code a month
>>>>>> back then we know the problem, if any, is of longer standing.
>>>>> This is 4.6.4-dev which seems to include my fix for the
>>>>> previous issue, so this issue is surely present in the current
>>>>> 4-6-release branch. It must be due to a somewhat exotic
>>>>> condition, since this code is widely used and we haven't had
>>>>> other reports.
>>>>>
>>>>> I think it should be easy to track this down with all the
>>>>> debug checks in the code.
>>>>> And if Carsten can send me his system and the conditions to
>>>>> reproduce it, I can also help with debugging.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Berk
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> Carsten
>>>>>>
>>>>>> >>
>>>>>> >> Mark
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner
>>>>>> <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >> we have a 120k atom system that crashes with
>>>>>> >>
>>>>>> >> ------------------------------------------------------
>>>>>> >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>>>>>> >> Source code file:
>>>>>> /home/c/gromacs/src/mdlib/nbnxn_search.c, line: 685
>>>>>> >>
>>>>>> >> Software inconsistency error:
>>>>>> >> Lost particles while sorting
>>>>>> >> For more information and tips for troubleshooting,
>>>>>> please check the GROMACS
>>>>>> >> website at http://www.gromacs.org/Documentation/Errors
>>>>>> >> -------------------------------------------------------
>>>>>> >>
>>>>>> >> if run with >= 2 MPI processes on a GPU and small
>>>>>> values for nstlist. On my workstation,
>>>>>> >> nstlist = 34 and larger works, whereas nstlist <= 33
>>>>>> lead to the above problem.
>>>>>> >>
>>>>>> >> Another system (60k atoms) does not produce this
>>>>>> problem, so system size seems
>>>>>> >> to matter as well.
>>>>>> >>
>>>>>> >> Looks like an old ghost:
>>>>>> >>
>>>>>> >> http://redmine.gromacs.org/issues/1153
>>>>>> >>
>>>>>> >>
>>>>>> >> Should I file a redmine issue?
>>>>>> >>
>>>>>> >> Carsten
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> gmx-developers mailing list
>>>>>> >> gmx-developers at gromacs.org
>>>>>> <mailto:gmx-developers at gromacs.org>
>>>>>> >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>> >> Please don't post (un)subscribe requests to the list.
>>>>>> Use the www interface or send it to
>>>>>> gmx-developers-request at gromacs.org
>>>>>> <mailto:gmx-developers-request at gromacs.org>.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> > --
>>>>>> > gmx-developers mailing list
>>>>>> > gmx-developers at gromacs.org
>>>>>> <mailto:gmx-developers at gromacs.org>
>>>>>> > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>> > Please don't post (un)subscribe requests to the list.
>>>>>> Use the
>>>>>> > www interface or send it to
>>>>>> gmx-developers-request at gromacs.org
>>>>>> <mailto:gmx-developers-request at gromacs.org>.
>>>>>>
>>>>>> --
>>>>>> gmx-developers mailing list
>>>>>> gmx-developers at gromacs.org
>>>>>> <mailto:gmx-developers at gromacs.org>
>>>>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>> Please don't post (un)subscribe requests to the list. Use the
>>>>>> www interface or send it to
>>>>>> gmx-developers-request at gromacs.org
>>>>>> <mailto:gmx-developers-request at gromacs.org>.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org
> <mailto:gmx-developers-request at gromacs.org>.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/b1b1eaad/attachment.html>
More information about the gromacs.org_gmx-developers
mailing list