[gmx-developers] Lost particles while sorting

Mark Abraham mark.j.abraham at gmail.com
Fri Nov 8 17:46:24 CET 2013


OK, no rush. Monday is more than fine for 4.6.4! :-)

Mark


On Fri, Nov 8, 2013 at 3:39 PM, Berk Hess <hess at kth.se> wrote:

>  I think I can upload a fix around 18:00.
>
> Cheers,
>
> Berk
>
>
> On 11/08/2013 03:37 PM, Mark Abraham wrote:
>
> OK, thanks Berk. Szilard and I each have thought of some minor
> documentation things that would be nice to fix also.
>
>  I suggest we try to get all that in so we can shift focus away from
> release-4-6, with the 5.0 beta only weeks away now!!
>
>  Mark
>
>
> On Fri, Nov 8, 2013 at 3:30 PM, Berk Hess <hess at kth.se> wrote:
>
>>  Hi,
>>
>> I think I found the source of the problem, now I have to come up with a
>> solution.
>> It seems you have a harmonic potential between two atoms at a distance of
>> 2.9 nm. This is much longer than the pair-list range. So one (non-local)
>> atom ends up beyond the non-local search grid. I thought I had accounted
>> for such cases, but apparently not.
>> We can probaly simply round down the index to the maximum, but that means
>> we can't check for other cases, such as without DD, where atoms can't be
>> beyond the grid.
>> The fix will be very simple and quick, but I need to think a over a bit
>> deeper.
>>
>> Cheers,
>>
>> Berk
>>
>>
>> On 11/08/2013 02:58 PM, Carsten Kutzner wrote:
>>
>> Hi Berk,
>>
>> On 11/08/2013 02:48 PM, Berk Hess wrote:
>>
>> Hi,
>>
>> I assume this is with GPUs.
>> If you run in a debugger, break on exit, can you tell me which sort_atoms
>> call this comes from?
>>
>>  It comes from sort_columns_supersub:
>>
>>
>> Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>> (gdb) where
>> #0  0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>> #1  0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-' <repeats
>> 55 times>, "\nProgram mdrun, VERSION 4.6.4-dev-20131107-ba8232e\nSource
>> code file:
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>> l"...) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
>> #2  0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal",
>> msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 - 10.229600)*58.394176) =
>> 89, not in 0 - 16*4\n", file=0x1836ad8
>> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>> line=609) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
>> #3  0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8
>> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>> line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not in 0 -
>> %d*%d\n") at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
>> #4  0x00000000004e91db in sort_atoms (dim=0, Backwards=0,
>> a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296, invh=58.3941765,
>> n_per_h=16, sort=0x7ffff10bb2b0) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
>> #5  0x00000000004ebcee in sort_columns_supersub (nbs=0x7ffff057f2a0,
>> dd_zone=1, grid=0x7ffff06d6f60, a0=61048, a1=74523, atinfo=0x7ffff0c51840,
>> x=0x7ffff148a650, nbat=0x7ffff05d3bb0, cxy_start=10, cxy_end=15,
>> sort_work=0x7ffff10bb2b0) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
>> #6  0x00000000004ecf09 in calc_cell_indices.omp_fn.1
>> (.omp_data_i=0x7ffff516f670) at
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
>> #7  0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
>> #8  0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
>> #9  0x00007ffff5dc42cd in clone () from /lib64/libc.so.6
>>
>>  On how many MPI ranks is this?
>> If I can easily run this, could you mail me the tpr and the run settings?
>>
>> mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00
>>
>> Best,
>> Carsten
>>
>>
>> Cheers,
>>
>> Berk
>>
>> On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>>
>> Hi,
>>
>> using a just checked-out 4.6 branch compiled with debug checks I get
>>
>> -------------------------------------------------------
>> Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
>> Source code file:
>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>> line: 609
>>
>> Fatal error:
>> (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not in 0 - 16*4
>>
>> For more information and tips for troubleshooting, please check the
>> GROMACS
>> website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>>
>> Carsten
>>
>>
>> On 11/08/2013 02:00 PM, Berk Hess wrote:
>>
>> On 11/08/2013 01:44 PM, Mark Abraham wrote:
>>
>>
>>
>>
>> On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
>>
>>> Hi Mark, hi Berk,
>>>
>>> On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se> wrote:
>>>
>>> > Hi Carsten,
>>> >
>>> > After how many steps does this happen?
>>>  this happens immedeately at startup.
>>>
>>> > Could you run with a debug build (or without NDEBUG defined)?
>>> > I added a lot of checks, not done with NDEBUG, in the fix for the
>>> issue you linked.
>>>  Will do that now.
>>>
>>> > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>>> >> Unclear. 6583c94 is one of your commits. Some very recent stuff has
>>> been playing with nstlist and rlist (safely, or so we thought.) Can you
>>> reproduce with mainstream release-4-6?
>>>  This is basically mainstream 4-6, since in my commit I only changed the
>>> default behavior of
>>> appending to no.
>>>
>>
>>  Right. What's the mainstream parent commit? I was going to release
>> 4.6.4 today - if you're based off the current tip then maybe we shouldn't.
>> If you're based off code a month back then we know the problem, if any, is
>> of longer standing.
>>
>> This is 4.6.4-dev which seems to include my fix for the previous issue,
>> so this issue is surely present in the current 4-6-release branch. It must
>> be due to a somewhat exotic condition, since this code is widely used and
>> we haven't had other reports.
>>
>> I think it should be easy to track this down with all the debug checks in
>> the code.
>> And if Carsten can send me his system and the conditions to reproduce it,
>> I can also help with debugging.
>>
>> Cheers,
>>
>> Berk
>>
>>
>>  Mark
>>
>>
>>> Carsten
>>>
>>> >>
>>> >> Mark
>>> >>
>>> >>
>>> >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner <ckutzne at gwdg.de>
>>> wrote:
>>> >> Hi,
>>> >>
>>> >> we have a 120k atom system that crashes with
>>> >>
>>> >> ------------------------------------------------------
>>> >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>>> >> Source code file: /home/c/gromacs/src/mdlib/nbnxn_search.c, line: 685
>>> >>
>>> >> Software inconsistency error:
>>> >> Lost particles while sorting
>>> >> For more information and tips for troubleshooting, please check the
>>> GROMACS
>>> >> website at http://www.gromacs.org/Documentation/Errors
>>> >> -------------------------------------------------------
>>> >>
>>> >> if run with >= 2 MPI processes on a GPU and small values for nstlist.
>>> On my workstation,
>>> >> nstlist = 34 and larger works, whereas nstlist <= 33 lead to the
>>> above problem.
>>> >>
>>> >> Another system (60k atoms) does not produce this problem, so system
>>> size seems
>>> >> to matter as well.
>>> >>
>>> >> Looks like an old ghost:
>>> >>
>>> >> http://redmine.gromacs.org/issues/1153
>>> >>
>>> >>
>>> >> Should I file a redmine issue?
>>> >>
>>> >> Carsten
>>> >>
>>> >>
>>> >> --
>>> >> gmx-developers mailing list
>>> >> gmx-developers at gromacs.org
>>> >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> >> Please don't post (un)subscribe requests to the list. Use the www
>>> interface or send it to gmx-developers-request at gromacs.org.
>>> >>
>>> >>
>>> >>
>>> >
>>> > --
>>> > gmx-developers mailing list
>>> > gmx-developers at gromacs.org
>>> > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> > Please don't post (un)subscribe requests to the list. Use the
>>> > www interface or send it to gmx-developers-request at gromacs.org.
>>>
>>> --
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-developers-request at gromacs.org.
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
>>
>
>
>
>
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/02a84c03/attachment.html>


More information about the gromacs.org_gmx-developers mailing list