[gmx-developers] Lost particles while sorting

Berk Hess hess at kth.se
Fri Nov 8 15:30:00 CET 2013


Hi,

I think I found the source of the problem, now I have to come up with a 
solution.
It seems you have a harmonic potential between two atoms at a distance 
of 2.9 nm. This is much longer than the pair-list range. So one 
(non-local) atom ends up beyond the non-local search grid. I thought I 
had accounted for such cases, but apparently not.
We can probaly simply round down the index to the maximum, but that 
means we can't check for other cases, such as without DD, where atoms 
can't be beyond the grid.
The fix will be very simple and quick, but I need to think a over a bit 
deeper.

Cheers,

Berk

On 11/08/2013 02:58 PM, Carsten Kutzner wrote:
> Hi Berk,
>
> On 11/08/2013 02:48 PM, Berk Hess wrote:
>> Hi,
>>
>> I assume this is with GPUs.
>> If you run in a debugger, break on exit, can you tell me which 
>> sort_atoms call this comes from?
>>
> It comes from sort_columns_supersub:
>
>
> Breakpoint 1, 0x00007ffff5d17920 in exit () from /lib64/libc.so.6
> (gdb) where
> #0  0x00007ffff5d17920 in exit () from /lib64/libc.so.6
> #1  0x0000000000a60558 in quit_gmx (msg=0x7fffee092290 "\n", '-' 
> <repeats 55 times>, "\nProgram mdrun, VERSION 
> 4.6.4-dev-20131107-ba8232e\nSource code file: 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c, 
> l"...) at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:284
> #2  0x0000000000a61609 in _gmx_error (key=0x185ecde "fatal", 
> msg=0x7fffee094af0 "(int)((x[74522][x]=11.764535 - 
> 10.229600)*58.394176) = 89, not in 0 - 16*4\n", file=0x1836ad8 
> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c", 
> line=609) at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:774
> #3  0x0000000000a61054 in gmx_fatal (f_errno=0, file=0x1836ad8 
> "/home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c", 
> line=609, fmt=0x1836c80 "(int)((x[%d][%c]=%f - %f)*%f) = %d, not in 0 
> - %d*%d\n") at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/gmxlib/gmx_fatal.c:509
> #4  0x00000000004e91db in sort_atoms (dim=0, Backwards=0, 
> a=0x7ffff212e920, n=2, x=0x7ffff148a650, h0=10.2296, invh=58.3941765, 
> n_per_h=16, sort=0x7ffff10bb2b0) at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:609
> #5  0x00000000004ebcee in sort_columns_supersub (nbs=0x7ffff057f2a0, 
> dd_zone=1, grid=0x7ffff06d6f60, a0=61048, a1=74523, 
> atinfo=0x7ffff0c51840, x=0x7ffff148a650, nbat=0x7ffff05d3bb0, 
> cxy_start=10, cxy_end=15, sort_work=0x7ffff10bb2b0) at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1394
> #6  0x00000000004ecf09 in calc_cell_indices.omp_fn.1 
> (.omp_data_i=0x7ffff516f670) at 
> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c:1651
> #7  0x00007ffff62a50ba in ?? () from /usr/lib64/libgomp.so.1
> #8  0x00007ffff796de0e in start_thread () from /lib64/libpthread.so.0
> #9  0x00007ffff5dc42cd in clone () from /lib64/libc.so.6
>
>> On how many MPI ranks is this?
>> If I can easily run this, could you mail me the tpr and the run settings?
> mdrun_threads -ntmpi 2 -s in.tpr -v -gpu_id 00
>
> Best,
> Carsten
>
>>
>> Cheers,
>>
>> Berk
>>
>> On 11/08/2013 02:30 PM, Carsten Kutzner wrote:
>>> Hi,
>>>
>>> using a just checked-out 4.6 branch compiled with debug checks I get
>>>
>>> -------------------------------------------------------
>>> Program mdrun, VERSION 4.6.4-dev-20131107-ba8232e
>>> Source code file: 
>>> /home/ckutzne/junoworkspace/git-gromacs-vanilla/src/mdlib/nbnxn_search.c, 
>>> line: 609
>>>
>>> Fatal error:
>>> (int)((x[74522][x]=11.764535 - 10.229600)*58.394176) = 89, not in 0 
>>> - 16*4
>>>
>>> For more information and tips for troubleshooting, please check the 
>>> GROMACS
>>> website at http://www.gromacs.org/Documentation/Errors
>>> -------------------------------------------------------
>>>
>>> Carsten
>>>
>>>
>>> On 11/08/2013 02:00 PM, Berk Hess wrote:
>>>> On 11/08/2013 01:44 PM, Mark Abraham wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 8, 2013 at 12:58 PM, Carsten Kutzner <ckutzne at gwdg.de 
>>>>> <mailto:ckutzne at gwdg.de>> wrote:
>>>>>
>>>>>     Hi Mark, hi Berk,
>>>>>
>>>>>     On Nov 7, 2013, at 6:48 PM, Berk Hess <hess at kth.se
>>>>>     <mailto:hess at kth.se>> wrote:
>>>>>
>>>>>     > Hi Carsten,
>>>>>     >
>>>>>     > After how many steps does this happen?
>>>>>     this happens immedeately at startup.
>>>>>
>>>>>     > Could you run with a debug build (or without NDEBUG defined)?
>>>>>     > I added a lot of checks, not done with NDEBUG, in the fix
>>>>>     for the issue you linked.
>>>>>     Will do that now.
>>>>>
>>>>>     > On 11/07/2013 06:27 PM, Mark Abraham wrote:
>>>>>     >> Unclear. 6583c94 is one of your commits. Some very recent
>>>>>     stuff has been playing with nstlist and rlist (safely, or so
>>>>>     we thought.) Can you reproduce with mainstream release-4-6?
>>>>>     This is basically mainstream 4-6, since in my commit I only
>>>>>     changed the default behavior of
>>>>>     appending to no.
>>>>>
>>>>>
>>>>> Right. What's the mainstream parent commit? I was going to release 
>>>>> 4.6.4 today - if you're based off the current tip then maybe we 
>>>>> shouldn't. If you're based off code a month back then we know the 
>>>>> problem, if any, is of longer standing.
>>>> This is 4.6.4-dev which seems to include my fix for the previous 
>>>> issue, so this issue is surely present in the current 4-6-release 
>>>> branch. It must be due to a somewhat exotic condition, since this 
>>>> code is widely used and we haven't had other reports.
>>>>
>>>> I think it should be easy to track this down with all the debug 
>>>> checks in the code.
>>>> And if Carsten can send me his system and the conditions to 
>>>> reproduce it, I can also help with debugging.
>>>>
>>>> Cheers,
>>>>
>>>> Berk
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>     Carsten
>>>>>
>>>>>     >>
>>>>>     >> Mark
>>>>>     >>
>>>>>     >>
>>>>>     >> On Thu, Nov 7, 2013 at 5:18 PM, Carsten Kutzner
>>>>>     <ckutzne at gwdg.de <mailto:ckutzne at gwdg.de>> wrote:
>>>>>     >> Hi,
>>>>>     >>
>>>>>     >> we have a 120k atom system that crashes with
>>>>>     >>
>>>>>     >> ------------------------------------------------------
>>>>>     >> Program mdrun_mpi, VERSION 4.6.4-dev-20131015-6583c94
>>>>>     >> Source code file: /home/c/gromacs/src/mdlib/nbnxn_search.c,
>>>>>     line: 685
>>>>>     >>
>>>>>     >> Software inconsistency error:
>>>>>     >> Lost particles while sorting
>>>>>     >> For more information and tips for troubleshooting, please
>>>>>     check the GROMACS
>>>>>     >> website at http://www.gromacs.org/Documentation/Errors
>>>>>     >> -------------------------------------------------------
>>>>>     >>
>>>>>     >> if run with >= 2 MPI processes on a GPU and small values
>>>>>     for nstlist. On my workstation,
>>>>>     >> nstlist = 34 and larger works, whereas nstlist <= 33 lead
>>>>>     to the above problem.
>>>>>     >>
>>>>>     >> Another system (60k atoms) does not produce this problem,
>>>>>     so system size seems
>>>>>     >> to matter as well.
>>>>>     >>
>>>>>     >> Looks like an old ghost:
>>>>>     >>
>>>>>     >> http://redmine.gromacs.org/issues/1153
>>>>>     >>
>>>>>     >>
>>>>>     >> Should I file a redmine issue?
>>>>>     >>
>>>>>     >> Carsten
>>>>>     >>
>>>>>     >>
>>>>>     >> --
>>>>>     >> gmx-developers mailing list
>>>>>     >> gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>>     >> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>     >> Please don't post (un)subscribe requests to the list. Use
>>>>>     the www interface or send it to
>>>>>     gmx-developers-request at gromacs.org
>>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>>     >>
>>>>>     >>
>>>>>     >>
>>>>>     >
>>>>>     > --
>>>>>     > gmx-developers mailing list
>>>>>     > gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>>     > http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>     > Please don't post (un)subscribe requests to the list. Use the
>>>>>     > www interface or send it to
>>>>>     gmx-developers-request at gromacs.org
>>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>>
>>>>>     --
>>>>>     gmx-developers mailing list
>>>>>     gmx-developers at gromacs.org <mailto:gmx-developers at gromacs.org>
>>>>>     http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>>     Please don't post (un)subscribe requests to the list. Use the
>>>>>     www interface or send it to gmx-developers-request at gromacs.org
>>>>>     <mailto:gmx-developers-request at gromacs.org>.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20131108/9b339ff4/attachment.html>


More information about the gromacs.org_gmx-developers mailing list