[gmx-developers] Lost particles while sorting

Mark Abraham mark.j.abraham at gmail.com
Fri Nov 8 15:37:03 CET 2013


OK, thanks Berk. Szilard and I each have thought of some minor
documentation things that would be nice to fix also.

I suggest we try to get all that in so we can shift focus away from
release-4-6, with the 5.0 beta only weeks away now!!

Mark


On Fri, Nov 8, 2013 at 3:30 PM, Berk Hess <hess at kth.se> wrote:

>  Hi,
>
> I think I found the source of the problem, now I have to come up with a
> solution.
> It seems you have a harmonic potential between two atoms at a distance of
> 2.9 nm. This is much longer than the pair-list range. So one (non-local)
> atom ends up beyond the non-local search grid. I thought I had accounted
> for such cases, but apparently not.
> We can probaly simply round down the index to the maximum, but that means
> we can't check for other cases, such as without DD, where atoms can't be
> beyond the grid.
> The fix will be very simple and quick, but I need to think a over a bit
> deeper.
>
> Cheers,
>
> Berk
>
>
> On 11/08/2013 02:58 PM, Carsten Kutzner wrote:
>
> Hi Berk,
>
> On 11/08/2013 02:48 PM, Berk Hess wrote:
>
> Hi,
>
> I assume this is with GPUs.
> If you run in a debugger, break on exit, can you tell me which sort_atoms
> call this comes from?
>
>  It comes from sort_columns_supersub:
>
>
> Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
> (gdb) where
> #0  0x00007ffff5d17920 in exit () from /lib64/libc.so.6
> #1  0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-' <repeats
> 55 times>, "\nProgram mdrun, VERSION 4.6.4-dev-20131107-ba8232e\nSource
> code file:
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
> l"...) at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
> #2  0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal",
> msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 - 10.229600)*58.394176) =
> 89, not in 0 - 16*4\n", file=0x1836ad8
> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
> line=609) at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
> #3  0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8
> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
> line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not in 0 -
> %d*%d\n") at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
> #4  0x00000000004e91db in sort_atoms (dim=0, Backwards=0,
> a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296, invh=58.3941765,
> n_per_h=16, sort=0x7ffff10bb2b0) at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
> #5  0x00000000004ebcee in sort_columns_supersub (nbs=0x7ffff057f2a0,
> dd_zone=1, grid=0x7ffff06d6f60, a0=61048, a1=74523, atinfo=0x7ffff0c51840,
> x=0x7ffff148a650, nbat=0x7ffff05d3bb0, cxy_start=10, cxy_end=15,
> sort_work=0x7ffff10bb2b0) at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
> #6  0x00000000004ecf09 in calc_cell_indices.omp_fn.1
> (.omp_data_i=0x7ffff516f670) at
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
> #7  0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
> #8  0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
> #9  0x00007ffff5dc42cd in clone () from /lib64/libc.so.6
>
>  On how many MPI ranks is this?
> If I can easily run this, could you mail me the tpr and the run settings?
>
> mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00
>
> Best,
> Carsten
>
>
> Cheers,
>
> Berk
>
> On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>
> Hi,
>
> using a just checked-out 4.6 branch compiled with debug checks I get
>
> -------------------------------------------------------
> Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
> Source code file:
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
> line: 609
>
> Fatal error:
> (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not in 0 - 16*4
>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> Carsten
>
>
> On 11/08/2013 02:00 PM, Berk Hess wrote:
>
> On 11/08/2013 01:44 PM, Mark Abraham wrote:
>
>
>
>
> On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
>
>> Hi Mark, hi Berk,
>>
>> On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se> wrote:
>>
>> > Hi Carsten,
>> >
>> > After how many steps does this happen?
>>  this happens immedeately at startup.
>>
>> > Could you run with a debug build (or without NDEBUG defined)?
>> > I added a lot of checks, not done with NDEBUG, in the fix for the issue
>> you linked.
>>  Will do that now.
>>
>> > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>> >> Unclear. 6583c94 is one of your commits. Some very recent stuff has
>> been playing with nstlist and rlist (safely, or so we thought.) Can you
>> reproduce with mainstream release-4-6?
>>  This is basically mainstream 4-6, since in my commit I only changed the
>> default behavior of
>> appending to no.
>>
>
>  Right. What's the mainstream parent commit? I was going to release 4.6.4
> today - if you're based off the current tip then maybe we shouldn't. If
> you're based off code a month back then we know the problem, if any, is of
> longer standing.
>
> This is 4.6.4-dev which seems to include my fix for the previous issue, so
> this issue is surely present in the current 4-6-release branch. It must be
> due to a somewhat exotic condition, since this code is widely used and we
> haven't had other reports.
>
> I think it should be easy to track this down with all the debug checks in
> the code.
> And if Carsten can send me his system and the conditions to reproduce it,
> I can also help with debugging.
>
> Cheers,
>
> Berk
>
>
>  Mark
>
>
>> Carsten
>>
>> >>
>> >> Mark
>> >>
>> >>
>> >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner <ckutzne at gwdg.de>
>> wrote:
>> >> Hi,
>> >>
>> >> we have a 120k atom system that crashes with
>> >>
>> >> ------------------------------------------------------
>> >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>> >> Source code file: /home/c/gromacs/src/mdlib/nbnxn_search.c, line: 685
>> >>
>> >> Software inconsistency error:
>> >> Lost particles while sorting
>> >> For more information and tips for troubleshooting, please check the
>> GROMACS
>> >> website at http://www.gromacs.org/Documentation/Errors
>> >> -------------------------------------------------------
>> >>
>> >> if run with >= 2 MPI processes on a GPU and small values for nstlist.
>> On my workstation,
>> >> nstlist = 34 and larger works, whereas nstlist <= 33 lead to the above
>> problem.
>> >>
>> >> Another system (60k atoms) does not produce this problem, so system
>> size seems
>> >> to matter as well.
>> >>
>> >> Looks like an old ghost:
>> >>
>> >> http://redmine.gromacs.org/issues/1153
>> >>
>> >>
>> >> Should I file a redmine issue?
>> >>
>> >> Carsten
>> >>
>> >>
>> >> --
>> >> gmx-developers mailing list
>> >> gmx-developers at gromacs.org
>> >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> >> Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-developers-request at gromacs.org.
>> >>
>> >>
>> >>
>> >
>> > --
>> > gmx-developers mailing list
>> > gmx-developers at gromacs.org
>> > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> > Please don't post (un)subscribe requests to the list. Use the
>> > www interface or send it to gmx-developers-request at gromacs.org.
>>
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
>>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/8f9bcc8c/attachment.html>


More information about the gromacs.org_gmx-developers mailing list