[gmx-developers] PME tuning-related hang [was Re: Gromacs 2016.3 (and earlier) freezing up.]

Fri Sep 15 20:25:39 CEST 2017

This issue appears to not be a GROMACS problem so much as a problem with
"huge pages" that is
triggered by PME tuning. PME tuning creates a large data structure for
every cutoff that it tries, which
is replicated on each PME node. These data structures are not freed during
tuning, so memory usage
expands. Normally it is still too small to cause problems. With huge pages,
however, I get errors from
"libhugetlbfs" and very slow runs if more than about five cutoffs are
attempted.

Sample output on NERSC Cori KNL with 32 nodes. Input system size is 248,101
atoms.

step 0
step 100, remaining wall clock time:    24 s
step  140: timed with pme grid 128 128 128, coulomb cutoff 1.200: 66.2
M-cycles
step  210: timed with pme grid 112 112 112, coulomb cutoff 1.336: 69.6
M-cycles
step  280: timed with pme grid 100 100 100, coulomb cutoff 1.496: 63.6
M-cycles
step  350: timed with pme grid 84 84 84, coulomb cutoff 1.781: 85.9 M-cycles
step  420: timed with pme grid 96 96 96, coulomb cutoff 1.559: 68.8 M-cycles
step  490: timed with pme grid 100 100 100, coulomb cutoff 1.496: 68.3
M-cycles
libhugetlbfs [nid08887:140420]: WARNING: New heap segment map at
0x10001200000 failed: Cannot allocate memory
libhugetlbfs [nid08881:97968]: WARNING: New heap segment map at
0x10001200000 failed: Cannot allocate memory
libhugetlbfs [nid08881:97978]: WARNING: New heap segment map at
0x10001200000 failed: Cannot allocate memory

Szilárd, to answer to your questions: This is the verlet scheme. The
problem happens during tuning, and
no problems occur if -notunepme is used. In fact, the best performance thus
far has been with 50% PME
nodes, using huge pages, and '-notunepme'.

John

On Wed, Sep 13, 2017 at 6:20 AM, Szilárd Páll <pall.szilard at gmail.com>
wrote:

> Forking the discussion as now we've learned more about the issue Åke
> is reporting and it is quiterather dissimilar.
>
> On Mon, Sep 11, 2017 at 8:09 PM, John Eblen <jeblen at acm.org> wrote:
> > Hi Szilárd
> >
> > No, I'm not using the group scheme.
>
>  $ grep -i 'cutoff-scheme' md.log
>    cutoff-scheme                  = Verlet
>
> > The problem seems similar because:
> >
> > 1) Deadlocks and very slow runs can be hard to distinguish.
> > 2) Since Mark mentioned it, I assume he believes PME tuning is a possible
> >     cause, which is also the cause in my situation.
>
> Does that mean you tested with "-notunepme" and the excessive memory
> usage could not be reproduced? Did the memory usage increase only
> during the tuning or did it keep increasing after the tuning
> completed?
>
> > 3) Åke may be experiencing higher-than-normal memory usage as far as I
> know.
> >     Not sure how you know otherwise.
> > 4) By "successful," I assume you mean the tuning had completed. That
> doesn't
> >     mean, though, that the tuning could not be creating conditions that
> > causes the
> >     problem, like an excessively high cutoff.
>
> Sure. However, it's unlikely that the tuning creates conditions under
> which the run proceeds after the after the initial tuning phase and
> keeps allocating memory (which is more prone to be the source of
> issues).
>
> I suggest to first rule our the bug I linked and if that's not the
> culprit, we can have a closer look.
>
> Cheers,
> --
> Szilárd
>
> >
> >
> > John
> >
> > On Mon, Sep 11, 2017 at 1:09 PM, Szilárd Páll <pall.szilard at gmail.com>
> > wrote:
> >>
> >> John,
> >>
> >> In what way do you think your problem is similar? Åke seems to be
> >> experiencing a deadlock after successful PME tuning, much later during
> >> the run, but no excessive memory usage.
> >>
> >> Do you happen to be using the group scheme with 2016.x (release code)?
> >>
> >> Your issue sounds more like it could be related to the the excessive
> >> tuning bug with group scheme fixed quite a few months ago, but it's
> >> yet to be released (https://redmine.gromacs.org/issues/2200).
> >>
> >> Cheers,
> >> --
> >> Szilárd
> >>
> >>
> >> On Mon, Sep 11, 2017 at 6:50 PM, John Eblen <jeblen at acm.org> wrote:
> >> > Hi
> >> >
> >> > I'm having a similar problem that is related to PME tuning. When it is
> >> > enabled, GROMACS often, but not
> >> > always, slows to a crawl and uses excessive amounts of memory. Using
> >> > "huge
> >> > pages" and setting a high
> >> > number of PME processes seems to exacerbate the problem.
> >> >
> >> > Also, occurrences of this problem seem to correlate with how high the
> >> > tuning
> >> > raises the cutoff value.
> >> >
> >> > Mark, can you give us more information on the problems with PME
> tuning?
> >> > Is
> >> > there a redmine?
> >> >
> >> >
> >> > Thanks
> >> > John
> >> >
> >> > On Mon, Sep 11, 2017 at 10:53 AM, Mark Abraham
> >> > <mark.j.abraham at gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Thanks. Was PME tuning active? Does it reproduce if that is disabled?
> >> >> Is
> >> >> the PME tuning still active? How many steps have taken place (at
> least
> >> >> as
> >> >> reported in the log file but ideally from processes)?
> >> >>
> >> >> Mark
> >> >>
> >> >> On Mon, Sep 11, 2017 at 4:42 PM Åke Sandgren
> >> >> <ake.sandgren at hpc2n.umu.se>
> >> >> wrote:
> >> >>>
> >> >>> My debugger run finally got to the lockup.
> >> >>>
> >> >>> All processes are waiting on various MPI operations.
> >> >>>
> >> >>> Attached a stack dump of all 56 tasks.
> >> >>>
> >> >>> I'll keep the debug session running for a while in case anyone wants
> >> >>> some more detailed data.
> >> >>> This is a RelwithDeb build though so not everything is available.
> >> >>>
> >> >>> On 09/08/2017 11:28 AM, Berk Hess wrote:
> >> >>> > But you should be able to get some (limited) information by
> >> >>> > attaching a
> >> >>> > debugger to an aldready running process with a release build.
> >> >>> >
> >> >>> > If you plan on compiling and running a new case, use a release +
> >> >>> > debug
> >> >>> > symbols build. That should run as fast as a release build.
> >> >>> >
> >> >>> > Cheers,
> >> >>> >
> >> >>> > Berk
> >> >>> >
> >> >>> > On 2017-09-08 11:23, Åke Sandgren wrote:
> >> >>> >> We have, at least, one case that when run over 2 nodes, or more,
> >> >>> >> quite
> >> >>> >> often (always) hangs, i.e. no more output in md.log or otherwise
> >> >>> >> while
> >> >>> >> mdrun still consumes cpu time. It takes a random time before it
> >> >>> >> happens,
> >> >>> >> like 1-3 days.
> >> >>> >>
> >> >>> >> The case can be shared if someone else wants to investigate. I'm
> >> >>> >> planning to run it in the debugger to be able to break and look
> at
> >> >>> >> states when it happens, but since it takes so long with the
> >> >>> >> production
> >> >>> >> build it is not something i'm looking forward to.
> >> >>> >>
> >> >>> >> On 09/08/2017 11:13 AM, Berk Hess wrote:
> >> >>> >>> Hi,
> >> >>> >>>
> >> >>> >>> We are far behind schedule for the 2017 release. We are working
> >> >>> >>> hard
> >> >>> >>> on
> >> >>> >>> it, but I don't think we can promise a date yet.
> >> >>> >>>
> >> >>> >>> We have a 2016.4 release planned for this week (might slip to
> next
> >> >>> >>> week). But if you can give us enough details to track down your
> >> >>> >>> hanging
> >> >>> >>> issue, we might be able to fix it in 2016.4.
> >> >>> >
> >> >>>
> >> >>> --
> >> >>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> >> >>> Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580
> 14
> >> >>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> >> >>> --
> >> >>> Gromacs Developers mailing list
> >> >>>
> >> >>> * Please search the archive at
> >> >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List
> >> >>> before
> >> >>> posting!
> >> >>>
> >> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >>>
> >> >>> * For (un)subscribe requests visit
> >> >>>
> >> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx
> -developers
> >> >>> or send a mail to gmx-developers-request at gromacs.org.
> >> >>
> >> >>
> >> >> --
> >> >> Gromacs Developers mailing list
> >> >>
> >> >> * Please search the archive at
> >> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List
> before
> >> >> posting!
> >> >>
> >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >>
> >> >> * For (un)subscribe requests visit
> >> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx
> -developers
> >> >> or
> >> >> send a mail to gmx-developers-request at gromacs.org.
> >> >
> >> >
> >> >
> >> > --
> >> > Gromacs Developers mailing list
> >> >
> >> > * Please search the archive at
> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List
> before
> >> > posting!
> >> >
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >
> >> > * For (un)subscribe requests visit
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx
> -developers
> >> > or
> >> > send a mail to gmx-developers-request at gromacs.org.
> >> --
> >> Gromacs Developers mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or
> >> send a mail to gmx-developers-request at gromacs.org.
> >
> >
> >
> > --
> > Gromacs Developers mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers
> or
> > send a mail to gmx-developers-request at gromacs.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20170915/0a9ddc63/attachment-0001.html>