[gmx-developers] Lost particles while sorting

Berk Hess hess at kth.se
Fri Nov 8 15:39:31 CET 2013


I think I can upload a fix around 18:00.

Cheers,

Berk

On 11/08/2013 03:37 PM, Mark Abraham wrote:
> OK, thanks Berk. Szilard and I each have thought of some minor 
> documentation things that would be nice to fix also.
>
> I suggest we try to get all that in so we can shift focus away from 
> release-4-6, with the 5.0 beta only weeks away now!!
>
> Mark
>
>
> On Fri, Nov 8, 2013 at 3:30 PM, Berk Hess <hess at kth.se 
> <mailto:hess at kth.se>> wrote:
>
>     Hi,
>
>     I think I found the source of the problem, now I have to come up
>     with a solution.
>     It seems you have a harmonic potential between two atoms at a
>     distance of 2.9 nm. This is much longer than the pair-list range.
>     So one (non-local) atom ends up beyond the non-local search grid.
>     I thought I had accounted for such cases, but apparently not.
>     We can probaly simply round down the index to the maximum, but
>     that means we can't check for other cases, such as without DD,
>     where atoms can't be beyond the grid.
>     The fix will be very simple and quick, but I need to think a over
>     a bit deeper.
>
>     Cheers,
>
>     Berk
>
>
>     On 11/08/2013 02:58 PM, Carsten Kutzner wrote:
>>     Hi Berk,
>>
>>     On 11/08/2013 02:48 PM, Berk Hess wrote:
>>>     Hi,
>>>
>>>     I assume this is with GPUs.
>>>     If you run in a debugger, break on exit, can you tell me which
>>>     sort_atoms call this comes from?
>>>
>>     It comes from sort_columns_supersub:
>>
>>
>>     Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>>     (gdb) where
>>     #0  0x00007ffff5d17920 in exit () from /lib64/libc.so.6
>>     #1  0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-'
>>     <repeats 55 times>, "\nProgram mdrun, VERSION
>>     4.6.4-dev-20131107-ba8232e\nSource code file:
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>>     l"...) at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
>>     #2  0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal",
>>     msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 -
>>     10.229600)*58.394176) = 89, not in 0 - 16*4\n", file=0x1836ad8
>>     "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>>     line=609) at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
>>     #3  0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8
>>     "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c",
>>     line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not
>>     in 0 - %d*%d\n") at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
>>     #4  0x00000000004e91db in sort_atoms (dim=0, Backwards=0,
>>     a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296,
>>     invh=58.3941765, n_per_h=16, sort=0x7ffff10bb2b0) at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
>>     #5  0x00000000004ebcee in sort_columns_supersub
>>     (nbs=0x7ffff057f2a0, dd_zone=1, grid=0x7ffff06d6f60, a0=61048,
>>     a1=74523, atinfo=0x7ffff0c51840, x=0x7ffff148a650,
>>     nbat=0x7ffff05d3bb0, cxy_start=10, cxy_end=15,
>>     sort_work=0x7ffff10bb2b0) at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
>>     #6  0x00000000004ecf09 in calc_cell_indices.omp_fn.1
>>     (.omp_data_i=0x7ffff516f670) at
>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
>>     #7  0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
>>     #8  0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
>>     #9  0x00007ffff5dc42cd in clone () from /lib64/libc.so.6
>>
>>>     On how many MPI ranks is this?
>>>     If I can easily run this, could you mail me the tpr and the run
>>>     settings?
>>     mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00
>>
>>     Best,
>>     Carsten
>>
>>>
>>>     Cheers,
>>>
>>>     Berk
>>>
>>>     On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>>>>     Hi,
>>>>
>>>>     using a just checked-out 4.6 branch compiled with debug checks
>>>>     I get
>>>>
>>>>     -------------------------------------------------------
>>>>     Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
>>>>     Source code file:
>>>>     /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c,
>>>>     line: 609
>>>>
>>>>     Fatal error:
>>>>     (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not
>>>>     in 0 - 16*4
>>>>
>>>>     For more information and tips for troubleshooting, please check
>>>>     the GROMACS
>>>>     website at http://www.gromacs.org/Documentation/Errors
>>>>     -------------------------------------------------------
>>>>
>>>>     Carsten
>>>>
>>>>
>>>>     On 11/08/2013 02:00 PM, Berk Hess wrote:
>>>>>     On 11/08/2013 01:44 PM, Mark Abraham wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>     On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner
>>>>>>     <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>>>
>>>>>>         Hi Mark, hi Berk,
>>>>>>
>>>>>>         On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se
>>>>>>         <mailto:hess at kth.se>> wrote:
>>>>>>
>>>>>>         > Hi Carsten,
>>>>>>         >
>>>>>>         > After how many steps does this happen?
>>>>>>         this happens immedeately at startup.
>>>>>>
>>>>>>         > Could you run with a debug build (or without NDEBUG
>>>>>>         defined)?
>>>>>>         > I added a lot of checks, not done with NDEBUG, in the
>>>>>>         fix for the issue you linked.
>>>>>>         Will do that now.
>>>>>>
>>>>>>         > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>>>>>>         >> Unclear. 6583c94 is one of your commits. Some very
>>>>>>         recent stuff has been playing with nstlist and rlist
>>>>>>         (safely, or so we thought.) Can you reproduce with
>>>>>>         mainstream release-4-6?
>>>>>>         This is basically mainstream 4-6, since in my commit I
>>>>>>         only changed the default behavior of
>>>>>>         appending to no.
>>>>>>
>>>>>>
>>>>>>     Right. What's the mainstream parent commit? I was going to
>>>>>>     release 4.6.4 today - if you're based off the current tip
>>>>>>     then maybe we shouldn't. If you're based off code a month
>>>>>>     back then we know the problem, if any, is of longer standing.
>>>>>     This is 4.6.4-dev which seems to include my fix for the
>>>>>     previous issue, so this issue is surely present in the current
>>>>>     4-6-release branch. It must be due to a somewhat exotic
>>>>>     condition, since this code is widely used and we haven't had
>>>>>     other reports.
>>>>>
>>>>>     I think it should be easy to track this down with all the
>>>>>     debug checks in the code.
>>>>>     And if Carsten can send me his system and the conditions to
>>>>>     reproduce it, I can also help with debugging.
>>>>>
>>>>>     Cheers,
>>>>>
>>>>>     Berk
>>>>>>
>>>>>>     Mark
>>>>>>
>>>>>>
>>>>>>         Carsten
>>>>>>
>>>>>>         >>
>>>>>>         >> Mark
>>>>>>         >>
>>>>>>         >>
>>>>>>         >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner
>>>>>>         <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>>>         >> Hi,
>>>>>>         >>
>>>>>>         >> we have a 120k atom system that crashes with
>>>>>>         >>
>>>>>>         >> ------------------------------------------------------
>>>>>>         >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>>>>>>         >> Source code file:
>>>>>>         /home/c/gromacs/src/mdlib/nbnxn_search.c, line: 685
>>>>>>         >>
>>>>>>         >> Software inconsistency error:
>>>>>>         >> Lost particles while sorting
>>>>>>         >> For more information and tips for troubleshooting,
>>>>>>         please check the GROMACS
>>>>>>         >> website at http://www.gromacs.org/Documentation/Errors
>>>>>>         >> -------------------------------------------------------
>>>>>>         >>
>>>>>>         >> if run with >= 2 MPI processes on a GPU and small
>>>>>>         values for nstlist. On my workstation,
>>>>>>         >> nstlist = 34 and larger works, whereas nstlist <= 33
>>>>>>         lead to the above problem.
>>>>>>         >>
>>>>>>         >> Another system (60k atoms) does not produce this
>>>>>>         problem, so system size seems
>>>>>>         >> to matter as well.
>>>>>>         >>
>>>>>>         >> Looks like an old ghost:
>>>>>>         >>
>>>>>>         >> http://redmine.gromacs.org/issues/1153
>>>>>>         >>
>>>>>>         >>
>>>>>>         >> Should I file a redmine issue?
>>>>>>         >>
>>>>>>         >> Carsten
>>>>>>         >>
>>>>>>         >>
>>>>>>         >> --
>>>>>>         >> gmx-developers mailing list
>>>>>>         >> gmx-developers at gromacs.org
>>>>>>         <mailto:gmx-developers at gromacs.org>
>>>>>>         >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>>         >> Please don't post (un)subscribe requests to the list.
>>>>>>         Use the www interface or send it to
>>>>>>         gmx-developers-request at gromacs.org
>>>>>>         <mailto:gmx-developers-request at gromacs.org>.
>>>>>>         >>
>>>>>>         >>
>>>>>>         >>
>>>>>>         >
>>>>>>         > --
>>>>>>         > gmx-developers mailing list
>>>>>>         > gmx-developers at gromacs.org
>>>>>>         <mailto:gmx-developers at gromacs.org>
>>>>>>         > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>>         > Please don't post (un)subscribe requests to the list.
>>>>>>         Use the
>>>>>>         > www interface or send it to
>>>>>>         gmx-developers-request at gromacs.org
>>>>>>         <mailto:gmx-developers-request at gromacs.org>.
>>>>>>
>>>>>>         --
>>>>>>         gmx-developers mailing list
>>>>>>         gmx-developers at gromacs.org
>>>>>>         <mailto:gmx-developers at gromacs.org>
>>>>>>         http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>>         Please don't post (un)subscribe requests to the list. Use the
>>>>>>         www interface or send it to
>>>>>>         gmx-developers-request at gromacs.org
>>>>>>         <mailto:gmx-developers-request at gromacs.org>.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>     --
>     gmx-developers mailing list
>     gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>     http://lists.gromacs.org/mailman/listinfo/gmx-developers
>     Please don't post (un)subscribe requests to the list. Use the
>     www interface or send it to gmx-developers-request at gromacs.org
>     <mailto:gmx-developers-request at gromacs.org>.
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/b1b1eaad/attachment.html>


More information about the gromacs.org_gmx-developers mailing list