[gmx-users] WG: WG: Issue with CUDA and gromacs

Tafelmeier, Stefanie Stefanie.Tafelmeier at zae-bayern.de
Thu Apr 11 15:57:41 CEST 2019


Hi Szilárd,

Many thanks, now it is clear to me also how the tests are verified.

This means, I can trust my energy calculation now.

Thanks again,
Steffi


-----Ursprüngliche Nachricht-----
Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von Szilárd Páll
Gesendet: Mittwoch, 10. April 2019 23:44
An: Discussion list for GROMACS users
Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs

Hi,
On Wed, Apr 10, 2019 at 4:19 PM Tafelmeier, Stefanie <
Stefanie.Tafelmeier at zae-bayern.de> wrote:

> Dear Szilárd and Jon,
>
> many thanks for your support.
>
> The system was Ubuntu 18.04 LTS, gcc 7.3 and CUDA 9.2.
> We upgraded now gcc (to 8.2) and CUDA (to 10.1).
>
> Now the regressiontests all pass.
> Also the tests Szilárd ask before are all running. Even just using mdrun
> -nt 80 works.
>

Great, this confirms that there was indeed a strange compatibility issue as
Jon suggested.

Many thanks! It seems that this was the origin of the problem.
>
> Just to be sure, I would like to have a look at the short range value of
> the complex test. As before some passed even without having the right
> values.
>

What do you mean by that?


> Is there a way to compare or a list with the correct outcome?
>

When the regressiontests are executed, the output by default lists all
commands that do the test runs as well as those that verify the outputs,
e.g.

$ perl gmxtest.pl complex
[...]
Testing acetonitrilRF . . . gmx grompp -f ./grompp.mdp -c ./conf -r ./conf
-p ./topol -maxwarn 10  >grompp.out 2>grompp.err
gmx check -s1 ./reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001
>checktpr.out 2>checktpr.err
 gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1
gmx check -e ./reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05
-lastener Potential >checkpot.out 2>checkpot.err
gmx check -f ./reference_s.trr -f2 traj.trr -tol 0.001 -abstol 0.05
>checkforce.out 2>checkforce.err
PASSED but check mdp file differences

The gmx check commands do the checking and the the reference_s|d files to
comapre against.

--
Szilárd


> Anyway, here is the link to the tar-ball of the complex folder in case
> there is interest:
> https://it-service.zae-bayern.de/Team/index.php/s/mMyt3MPEfRrn8Ge
>
> Many thanks again for your help.
>
> Best wishes,
> Steffi
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von
> Jonathan Vincent
> Gesendet: Dienstag, 9. April 2019 22:13
> An: gmx-users at gromacs.org
> Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
>
> Hi,
>
> Which operating system are you running on? We have seen some strange
> behavior with large number of threads, gcc 7.3 and a newish version of
> glibc. Specifically the default combination that comes with Ubuntu 18.04
> LTS, but it might be more generic than that.
>
> My suggestion would be to update to gcc 8.3 and CUDA 10.1 (which is
> required for CUDA support of gcc 8), which seemed to fix the problem in
> that case.
>
> If you still have problems we can look at this some more.
>
> Jon
>
> -----Original Message-----
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> On Behalf Of Szilárd
> Páll
> Sent: 09 April 2019 20:08
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Subject: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
>
> Hi,
>
> One more test I realized it may be relevant considering that we had a
> similar report earlier this year on similar CPU hardware:
> can you please compile with -DGMX_SIMD=AVX2_256 and rerun the tests?
>
> --
> Szilárd
>
>
> On Tue, Apr 9, 2019 at 8:35 PM Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
> > Dear Stefanie,
> >
> > On Fri, Apr 5, 2019 at 11:48 AM Tafelmeier, Stefanie <
> > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> >
> >> Hi Szilárd,
> >>
> >> thanks for your advices.
> >> I performed the tests.
> >> Both performed without errors.
> >>
> >
> > OK, that excludes simple and obvious issues.
> > Wild guess, but can you run those again, but this time prefix the
> > command with "taskset -c 22-32"
> > ? This makes the tests use cores 22-32 just to check if using a
> > specific set of cores may somehow trigger an error.
> >
> > What CUDA version did you use to compiler the memtest tool -- was it
> > the same (CUDA 9.2) as the one used for building GROMACS?
> >
> > Just to get it right; I have to ask in more detail, because the
> > connection
> >> between is the CPU/GPU and calculation distribution is still a bit
> >> blurry to me:
> >>
> >> If the output of the regressiontests show that the test crashes after
> >> 1-2 steps, this means there is an issue between the transfer between
> >> the CPU and GPU?
> >> As far as I got the short range calculation part is normally split
> >> into nonbonded -> GPU and bonded -> CPU?
> >>
> >
> > The -nb/-pme/-bonded flags control which tasks executes where (if not
> > specified defaults control this); the output contains a report which
> > summarizes where the major force tasks are executed, e.g. this is from
> > one of your log files which tells that PP (i.e. particle tasks like
> > short-range
> > nonbonded) and the full PME tasks are offloaded to a GPU with ID 0
> > (and to check which GPU is that you can look at the "Hardware
> > detection" section of the log):
> >
> > 1 GPU selected for this run.
> > Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
> >   PP:0,PME:0
> > PP tasks will do (non-perturbed) short-ranged interactions on the GPU
> > PME tasks will do all aspects on the GPU
> >
> > For more details, please see
> > http://manual.gromacs.org/documentation/2019.1/user-guide/mdrun-perfor
> > mance.html#running-mdrun-with-gpus
> >
> > We have seen two types of errors so far:
> > - "Asynchronous H2D copy failed: invalid argument" which is still
> > mysterious to me and has showed up both in your repeated manual runs
> > as well as the regressiontest; as this aborts the run
> > - Failing regressiontests with either invalid results or crashes
> > (below above abort): to be honest I do not know what causes these but
> > given that results
> >
> > The latter errors indicate incorrect results, in your last "complex"
> > tests tarball I saw some tests failing with LINCS errors (and
> > indicating NaN
> > values) and a good fraction of tests failing with a GPU-side
> > assertions -- both of which suggest that things do go wrong on the GPU.
> >
> > And does this mean that maybe also the calculation I do, have wrong
> >> energies? Can I trust my results?
> >>
> >
> > At this point I can unfortunately not recommend running production
> > simulations on this machine.
> >
> > Will try to continue exploring the possible errors and I hope you can
> > help out with some test:
> >
> > - Please run the complex regressiontests (using the RelWithAssert
> > binary) by setting the CUDA_LAUNCH_BLOCKING environment variable. This
> > may allow us to reason better about the source of the errors. Also you
> > can reconfigure with cmake -DGMX_OPENMP_MAX_THREADS=128 to avoid the
> > 88 OpenMP thread errors in tests that you encountered yourself.
> >
> > - Can you please update compiler GROMACS with CUDA 10 and check if
> > either of two kinds of errors does reproduce. (If it does, if you can
> > upgrade the driver I suggest upgrading to CUDA 10.1).
> >
> >
> >
> >>
> >> Many thanks again for your support.
> >> Best wishes,
> >> Steffi
> >>
> >>
> > --
> > Szilárd
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list