[gmx-users] WG: WG: Issue with CUDA and gromacs

Tue Apr 9 22:12:44 CEST 2019

Hi,

Which operating system are you running on? We have seen some strange behavior with large number of threads, gcc 7.3 and a newish version of glibc. Specifically the default combination that comes with Ubuntu 18.04 LTS, but it might be more generic than that. 

My suggestion would be to update to gcc 8.3 and CUDA 10.1 (which is required for CUDA support of gcc 8), which seemed to fix the problem in that case.

If you still have problems we can look at this some more.

Jon

-----Original Message-----
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> On Behalf Of Szilárd Páll
Sent: 09 April 2019 20:08
To: Discussion list for GROMACS users <gmx-users at gromacs.org>
Subject: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs

Hi,

One more test I realized it may be relevant considering that we had a similar report earlier this year on similar CPU hardware:
can you please compile with -DGMX_SIMD=AVX2_256 and rerun the tests?

--
Szilárd

On Tue, Apr 9, 2019 at 8:35 PM Szilárd Páll <pall.szilard at gmail.com> wrote:

> Dear Stefanie,
>
> On Fri, Apr 5, 2019 at 11:48 AM Tafelmeier, Stefanie < 
> Stefanie.Tafelmeier at zae-bayern.de> wrote:
>
>> Hi Szilárd,
>>
>> thanks for your advices.
>> I performed the tests.
>> Both performed without errors.
>>
>
> OK, that excludes simple and obvious issues.
> Wild guess, but can you run those again, but this time prefix the 
> command with "taskset -c 22-32"
> ? This makes the tests use cores 22-32 just to check if using a 
> specific set of cores may somehow trigger an error.
>
> What CUDA version did you use to compiler the memtest tool -- was it 
> the same (CUDA 9.2) as the one used for building GROMACS?
>
> Just to get it right; I have to ask in more detail, because the 
> connection
>> between is the CPU/GPU and calculation distribution is still a bit 
>> blurry to me:
>>
>> If the output of the regressiontests show that the test crashes after 
>> 1-2 steps, this means there is an issue between the transfer between 
>> the CPU and GPU?
>> As far as I got the short range calculation part is normally split 
>> into nonbonded -> GPU and bonded -> CPU?
>>
>
> The -nb/-pme/-bonded flags control which tasks executes where (if not 
> specified defaults control this); the output contains a report which 
> summarizes where the major force tasks are executed, e.g. this is from 
> one of your log files which tells that PP (i.e. particle tasks like 
> short-range
> nonbonded) and the full PME tasks are offloaded to a GPU with ID 0 
> (and to check which GPU is that you can look at the "Hardware 
> detection" section of the log):
>
> 1 GPU selected for this run.
> Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
>   PP:0,PME:0
> PP tasks will do (non-perturbed) short-ranged interactions on the GPU 
> PME tasks will do all aspects on the GPU
>
> For more details, please see
> http://manual.gromacs.org/documentation/2019.1/user-guide/mdrun-perfor
> mance.html#running-mdrun-with-gpus
>
> We have seen two types of errors so far:
> - "Asynchronous H2D copy failed: invalid argument" which is still 
> mysterious to me and has showed up both in your repeated manual runs 
> as well as the regressiontest; as this aborts the run
> - Failing regressiontests with either invalid results or crashes 
> (below above abort): to be honest I do not know what causes these but 
> given that results
>
> The latter errors indicate incorrect results, in your last "complex" 
> tests tarball I saw some tests failing with LINCS errors (and 
> indicating NaN
> values) and a good fraction of tests failing with a GPU-side 
> assertions -- both of which suggest that things do go wrong on the GPU.
>
> And does this mean that maybe also the calculation I do, have wrong
>> energies? Can I trust my results?
>>
>
> At this point I can unfortunately not recommend running production 
> simulations on this machine.
>
> Will try to continue exploring the possible errors and I hope you can 
> help out with some test:
>
> - Please run the complex regressiontests (using the RelWithAssert 
> binary) by setting the CUDA_LAUNCH_BLOCKING environment variable. This 
> may allow us to reason better about the source of the errors. Also you 
> can reconfigure with cmake -DGMX_OPENMP_MAX_THREADS=128 to avoid the 
> 88 OpenMP thread errors in tests that you encountered yourself.
>
> - Can you please update compiler GROMACS with CUDA 10 and check if 
> either of two kinds of errors does reproduce. (If it does, if you can 
> upgrade the driver I suggest upgrading to CUDA 10.1).
>
>
>
>>
>> Many thanks again for your support.
>> Best wishes,
>> Steffi
>>
>>
> --
> Szilárd
>
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------