[gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

Alex nedomacho at gmail.com
Thu Feb 8 15:37:29 CET 2018


I am rebooting the box and kicking out all the jobs until we figure this 
out.

Thanks!

Alex


On 2/8/2018 7:27 AM, Szilárd Páll wrote:
> BTW, timeouts can be caused by contention from stupid number of ranks/tMPI
> threads hammering a single GPU (especially with 2 threads/core with HT),
> but I'm not sure if the tests are ever executed with such a huge rank count.
>
> --
> Szilárd
>
> On Thu, Feb 8, 2018 at 2:40 PM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
>
>> Hi,
>>
>> On Thu, Feb 8, 2018 at 2:15 PM Alex <nedomacho at gmail.com> wrote:
>>
>>> Mark and Peter,
>>>
>>> Thanks for commenting. I was told that all CUDA tests passed, but I will
>>> double check on how many of those were actually run. Also, we never
>>> rebooted the box after CUDA install, and finally we had a bunch of
>>> gromacs (2016.4) jobs running, because we didn't want to interrupt
>>> postdoc's work... All of those were with -nb cpu though. Could those
>>> factors have affected our regression tests?
>>>
>> Can't say. You observed timeouts, which could be consistent with drivers or
>> runtimes getting stuck. However, the other mdrun processes may have by
>> default set thread affinity, and any process that does that will interfere
>> with how effectively any others run, such as the tests. Sharing a node is
>> difficult to do well, and doing anything else with a node running GROMACS
>> is asking for trouble unless you have manually managed keeping the tasks
>> apart. Just don't.
>>
>> Mark
>>
>>
>>> It will really suck, if these are hardware-related...
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>
>>> On 2/8/2018 3:03 AM, Mark Abraham wrote:
>>>> Hi,
>>>>
>>>> Or leftovers of the drivers that are now mismatching. That has caused
>>>> timeouts for us.
>>>>
>>>> Mark
>>>>
>>>> On Thu, Feb 8, 2018 at 10:55 AM Peter Kroon <p.c.kroon at rug.nl> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> with changing failures like this I would start to suspect the hardware
>>>>> as well. Mark's suggestion of looking at simpler test programs than
>> GMX
>>>>> is a good one :)
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> On 08-02-18 09 <08-02%2018%2009> <08-02%2018%2009>:10, Mark Abraham
>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> That suggests that your new CUDA installation is differently
>>> incomplete.
>>>>> Do
>>>>>> its samples or test programs run?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On Thu, Feb 8, 2018 at 1:20 AM Alex <nedomacho at gmail.com> wrote:
>>>>>>
>>>>>>> Update: we seem to have had a hiccup with an orphan CUDA install and
>>>>> that
>>>>>>> was causing issues. After wiping everything off and rebuilding the
>>>>> errors
>>>>>>> from the initial post disappeared. However, two tests failed during
>>>>>>> regression:
>>>>>>>
>>>>>>> 95% tests passed, 2 tests failed out of 39
>>>>>>>
>>>>>>> Label Time Summary:
>>>>>>> GTest              = 170.83 sec (33 tests)
>>>>>>> IntegrationTest    = 125.00 sec (3 tests)
>>>>>>> MpiTest            =   4.90 sec (3 tests)
>>>>>>> UnitTest           =  45.83 sec (30 tests)
>>>>>>>
>>>>>>> Total Test time (real) = 1225.65 sec
>>>>>>>
>>>>>>> The following tests FAILED:
>>>>>>>     9 - GpuUtilsUnitTests (Timeout)
>>>>>>> 32 - MdrunTests (Timeout)
>>>>>>> Errors while running CTest
>>>>>>> CMakeFiles/run-ctest-nophys.dir/build.make:57: recipe for target
>>>>>>> 'CMakeFiles/run-ctest-nophys' failed
>>>>>>> make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
>>>>>>> CMakeFiles/Makefile2:1160: recipe for target
>>>>>>> 'CMakeFiles/run-ctest-nophys.dir/all' failed
>>>>>>> make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
>>>>>>> CMakeFiles/Makefile2:971: recipe for target
>>> 'CMakeFiles/check.dir/rule'
>>>>>>> failed
>>>>>>> make[1]: *** [CMakeFiles/check.dir/rule] Error 2
>>>>>>> Makefile:546: recipe for target 'check' failed
>>>>>>> make: *** [check] Error 2
>>>>>>>
>>>>>>> Any ideas? I can post the complete log, if needed.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Alex
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>



More information about the gromacs.org_gmx-users mailing list