[gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

Szilárd Páll pall.szilard at gmail.com
Thu Feb 8 15:27:32 CET 2018


BTW, timeouts can be caused by contention from stupid number of ranks/tMPI
threads hammering a single GPU (especially with 2 threads/core with HT),
but I'm not sure if the tests are ever executed with such a huge rank count.

--
Szilárd

On Thu, Feb 8, 2018 at 2:40 PM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:

> Hi,
>
> On Thu, Feb 8, 2018 at 2:15 PM Alex <nedomacho at gmail.com> wrote:
>
> > Mark and Peter,
> >
> > Thanks for commenting. I was told that all CUDA tests passed, but I will
> > double check on how many of those were actually run. Also, we never
> > rebooted the box after CUDA install, and finally we had a bunch of
> > gromacs (2016.4) jobs running, because we didn't want to interrupt
> > postdoc's work... All of those were with -nb cpu though. Could those
> > factors have affected our regression tests?
> >
>
> Can't say. You observed timeouts, which could be consistent with drivers or
> runtimes getting stuck. However, the other mdrun processes may have by
> default set thread affinity, and any process that does that will interfere
> with how effectively any others run, such as the tests. Sharing a node is
> difficult to do well, and doing anything else with a node running GROMACS
> is asking for trouble unless you have manually managed keeping the tasks
> apart. Just don't.
>
> Mark
>
>
> > It will really suck, if these are hardware-related...
> >
> > Thanks,
> >
> > Alex
> >
> >
> > On 2/8/2018 3:03 AM, Mark Abraham wrote:
> > > Hi,
> > >
> > > Or leftovers of the drivers that are now mismatching. That has caused
> > > timeouts for us.
> > >
> > > Mark
> > >
> > > On Thu, Feb 8, 2018 at 10:55 AM Peter Kroon <p.c.kroon at rug.nl> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >> with changing failures like this I would start to suspect the hardware
> > >> as well. Mark's suggestion of looking at simpler test programs than
> GMX
> > >> is a good one :)
> > >>
> > >>
> > >> Peter
> > >>
> > >>
> > >> On 08-02-18 09 <08-02%2018%2009> <08-02%2018%2009>:10, Mark Abraham
> > wrote:
> > >>> Hi,
> > >>>
> > >>> That suggests that your new CUDA installation is differently
> > incomplete.
> > >> Do
> > >>> its samples or test programs run?
> > >>>
> > >>> Mark
> > >>>
> > >>> On Thu, Feb 8, 2018 at 1:20 AM Alex <nedomacho at gmail.com> wrote:
> > >>>
> > >>>> Update: we seem to have had a hiccup with an orphan CUDA install and
> > >> that
> > >>>> was causing issues. After wiping everything off and rebuilding the
> > >> errors
> > >>>> from the initial post disappeared. However, two tests failed during
> > >>>> regression:
> > >>>>
> > >>>> 95% tests passed, 2 tests failed out of 39
> > >>>>
> > >>>> Label Time Summary:
> > >>>> GTest              = 170.83 sec (33 tests)
> > >>>> IntegrationTest    = 125.00 sec (3 tests)
> > >>>> MpiTest            =   4.90 sec (3 tests)
> > >>>> UnitTest           =  45.83 sec (30 tests)
> > >>>>
> > >>>> Total Test time (real) = 1225.65 sec
> > >>>>
> > >>>> The following tests FAILED:
> > >>>>    9 - GpuUtilsUnitTests (Timeout)
> > >>>> 32 - MdrunTests (Timeout)
> > >>>> Errors while running CTest
> > >>>> CMakeFiles/run-ctest-nophys.dir/build.make:57: recipe for target
> > >>>> 'CMakeFiles/run-ctest-nophys' failed
> > >>>> make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
> > >>>> CMakeFiles/Makefile2:1160: recipe for target
> > >>>> 'CMakeFiles/run-ctest-nophys.dir/all' failed
> > >>>> make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
> > >>>> CMakeFiles/Makefile2:971: recipe for target
> > 'CMakeFiles/check.dir/rule'
> > >>>> failed
> > >>>> make[1]: *** [CMakeFiles/check.dir/rule] Error 2
> > >>>> Makefile:546: recipe for target 'check' failed
> > >>>> make: *** [check] Error 2
> > >>>>
> > >>>> Any ideas? I can post the complete log, if needed.
> > >>>>
> > >>>> Thank you,
> > >>>>
> > >>>> Alex
> > >>>> --
> > >>>> Gromacs Users mailing list
> > >>>>
> > >>>> * Please search the archive at
> > >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > >>>> posting!
> > >>>>
> > >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >>>>
> > >>>> * For (un)subscribe requests visit
> > >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > >>>> send a mail to gmx-users-request at gromacs.org.
> > >>>>
> > >>
> > >> --
> > >> Gromacs Users mailing list
> > >>
> > >> * Please search the archive at
> > >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > >> posting!
> > >>
> > >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >>
> > >> * For (un)subscribe requests visit
> > >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > >> send a mail to gmx-users-request at gromacs.org.
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list