[gmx-users] WG: WG: Issue with CUDA and gromacs

Tue Apr 9 21:07:56 CEST 2019

Hi,

One more test I realized it may be relevant considering that we had a
similar report earlier this year on similar CPU hardware:
can you please compile with -DGMX_SIMD=AVX2_256 and rerun the tests?

--
Szilárd

On Tue, Apr 9, 2019 at 8:35 PM Szilárd Páll <pall.szilard at gmail.com> wrote:

> Dear Stefanie,
>
> On Fri, Apr 5, 2019 at 11:48 AM Tafelmeier, Stefanie <
> Stefanie.Tafelmeier at zae-bayern.de> wrote:
>
>> Hi Szilárd,
>>
>> thanks for your advices.
>> I performed the tests.
>> Both performed without errors.
>>
>
> OK, that excludes simple and obvious issues.
> Wild guess, but can you run those again, but this time prefix the command
> with
> "taskset -c 22-32"
> ? This makes the tests use cores 22-32 just to check if using a specific
> set of cores may somehow trigger an error.
>
> What CUDA version did you use to compiler the memtest tool -- was it the
> same (CUDA 9.2) as the one used for building GROMACS?
>
> Just to get it right; I have to ask in more detail, because the connection
>> between is the CPU/GPU and calculation distribution is still a bit blurry
>> to me:
>>
>> If the output of the regressiontests show that the test crashes after 1-2
>> steps, this means there is an issue between the transfer between the CPU
>> and GPU?
>> As far as I got the short range calculation part is normally split into
>> nonbonded -> GPU and bonded -> CPU?
>>
>
> The -nb/-pme/-bonded flags control which tasks executes where (if not
> specified defaults control this); the output contains a report which
> summarizes where the major force tasks are executed, e.g. this is from one
> of your log files which tells that PP (i.e. particle tasks like short-range
> nonbonded) and the full PME tasks are offloaded to a GPU with ID 0 (and to
> check which GPU is that you can look at the "Hardware detection" section of
> the log):
>
> 1 GPU selected for this run.
> Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
>   PP:0,PME:0
> PP tasks will do (non-perturbed) short-ranged interactions on the GPU
> PME tasks will do all aspects on the GPU
>
> For more details, please see
> http://manual.gromacs.org/documentation/2019.1/user-guide/mdrun-performance.html#running-mdrun-with-gpus
>
> We have seen two types of errors so far:
> - "Asynchronous H2D copy failed: invalid argument" which is still
> mysterious to me and has showed up both in your repeated manual runs as
> well as the regressiontest; as this aborts the run
> - Failing regressiontests with either invalid results or crashes (below
> above abort): to be honest I do not know what causes these but given that
> results
>
> The latter errors indicate incorrect results, in your last "complex" tests
> tarball I saw some tests failing with LINCS errors (and indicating NaN
> values) and a good fraction of tests failing with a GPU-side assertions --
> both of which suggest that things do go wrong on the GPU.
>
> And does this mean that maybe also the calculation I do, have wrong
>> energies? Can I trust my results?
>>
>
> At this point I can unfortunately not recommend running production
> simulations on this machine.
>
> Will try to continue exploring the possible errors and I hope you can help
> out with some test:
>
> - Please run the complex regressiontests (using the RelWithAssert binary)
> by setting the CUDA_LAUNCH_BLOCKING environment variable. This may allow us
> to reason better about the source of the errors. Also you can reconfigure
> with cmake -DGMX_OPENMP_MAX_THREADS=128 to avoid the 88 OpenMP thread
> errors in tests that you encountered yourself.
>
> - Can you please update compiler GROMACS with CUDA 10 and check if either
> of two kinds of errors does reproduce. (If it does, if you can upgrade the
> driver I suggest upgrading to CUDA 10.1).
>
>
>
>>
>> Many thanks again for your support.
>> Best wishes,
>> Steffi
>>
>>
> --
> Szilárd
>