[gmx-users] WG: WG: Issue with CUDA and gromacs
Tafelmeier, Stefanie
Stefanie.Tafelmeier at zae-bayern.de
Fri Mar 29 16:53:35 CET 2019
Hi Szilárd,
thanks for your advices.
I performed the tests.
Both performed without errors.
Just to get it right; I have to ask in more detail, because the connection between is the CPU/GPU and calculation distribution is still a bit blurry to me:
If the output of the regressiontests show that the test crashes after 1-2 steps, this means there is an issue between the transfer between the CPU and GPU?
As far as I got the short range calculation part is normally split into nonbonded -> GPU and bonded -> CPU?
And does this mean that maybe also the calculation I do, have wrong energies? Can I trust my results?
Many thanks again for your support.
Best wishes,
Steffi
-----Ursprüngliche Nachricht-----
Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von Szilárd Páll
Gesendet: Freitag, 29. März 2019 01:24
An: Discussion list for GROMACS users
Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
Hi,
The standard output of the first set of runs is also something I was
interested in, but I've found the equivalent in the
complex/TESTDIR/mdrun.out files. What I see in the regresiontests output is
that the forces/energies results are simply not correct; some tests simply
crash after 1-2 steps, but others do complete (like the nbnxn-free-energy/)
and the short-range energies a clearly far off.
I suggest to try to check if there may be hardware issue:
- run this memory testing tool:
git clone https://github.com/ComputationalRadiationPhysics/cuda_memtest.git
cd cuda_memtest
make cuda_memtest CFLAGS='-arch sm_30 -DSM_20 -O3 -DENABLE_NVML=0'
./cuda_memtest
- compile and run the gpu-burn tool:
git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
make
then run
gpu-burn 300
to test for 5 minutes.
--
Szilárd
On Thu, Mar 28, 2019 at 3:46 PM Tafelmeier, Stefanie <
Stefanie.Tafelmeier at zae-bayern.de> wrote:
> Hi Szilárd,
>
> Thanks again!
>
> Regarding the test:
> -ntmpi 1 -ntomp 22 -pin on -pinstride 1: 2 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/XEQrYqq4pikGmMy /
> https://it-service.zae-bayern.de/Team/index.php/s/YBdKKJ9c7zQpEg9
> Including:
> -nsteps 0 -nb gpu -pme cpu -bonded cpu: 0 run
> https://it-service.zae-bayern.de/Team/index.php/s/YiByc7iXW5AW9ZX
> -nsteps 0 -nb gpu -pme gpu -bonded cpu: 2 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/JNPXQnEgYtTAxGj /
> https://it-service.zae-bayern.de/Team/index.php/s/6aq6BQwwbBELqWe
> -nsteps 0 -nb gpu -pme gpu -bonded gpu: 0 run
> https://it-service.zae-bayern.de/Team/index.php/s/yj4RAqPMFsDNgTc
>
> Including:
> -ntmpi 1 -ntomp 22 -pin on -pinstride 2: 1 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/q5jHbdJ2EygtDaQ /
> https://it-service.zae-bayern.de/Team/index.php/s/sRPccwHRxojW9J8
> -nsteps 0 -nb gpu -pme cpu -bonded cpu: 0 run
> https://it-service.zae-bayern.de/Team/index.php/s/GdKk5N68CY7BGxJ
> -nsteps 0 -nb gpu -pme gpu -bonded cpu: 1 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/orwzKJMampWwDo5 /
> https://it-service.zae-bayern.de/Team/index.php/s/JXApT4tFtxQWxG6
> -nsteps 0 -nb gpu -pme gpu -bonded gpu: 0 run
> https://it-service.zae-bayern.de/Team/index.php/s/8YKK7Zxax22RfGQ
>
> Including:
> -ntmpi 1 -ntomp 22 -pin on -pinstride 4: 1 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/szZjzaxmwfimrgB /
> https://it-service.zae-bayern.de/Team/index.php/s/QdTd2an9dbE9BSt
> -nsteps 0 -nb gpu -pme cpu -bonded cpu: 3 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/DPoqKrgcWfF5PKM /
> https://it-service.zae-bayern.de/Team/index.php/s/3NbsGHtCPsf7zFS
> -nsteps 0 -nb gpu -pme gpu -bonded cpu: 3 out of 5 run
> https://it-service.zae-bayern.de/Team/index.php/s/WqP4tXjrR8i3455 /
> https://it-service.zae-bayern.de/Team/index.php/s/DACGc86xxKR6pWs
> -nsteps 0 -nb gpu -pme gpu -bonded gpu: 0 run
> https://it-service.zae-bayern.de/Team/index.php/s/3nKdwA28KySLEdB
>
>
> Regarding the regressiontest:
> Here is the link to the tarball:
> https://it-service.zae-bayern.de/Team/index.php/s/mMyt3MPEfRrn8Ge
>
>
> Thanks again for all your support and fingers crossed!
>
> Best wishes,
> Steffi
>
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von Szilárd
> Páll
> Gesendet: Mittwoch, 27. März 2019 20:27
> An: Discussion list for GROMACS users
> Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
>
> Hi Steffi,
>
> On Wed, Mar 27, 2019 at 1:08 PM Tafelmeier, Stefanie <
> Stefanie.Tafelmeier at zae-bayern.de> wrote:
>
> > Hi Szilárd,
> >
> > thanks again!
> > Here are the links for the log files, that didn't run:
> > Old patch:
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 1: none ran*
> > https://it-service.zae-bayern.de/Team/index.php/s/b4AYiMCoHeNgJH3
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 2: none ran*
> > https://it-service.zae-bayern.de/Team/index.php/s/JEP2iwFFZCebZLF
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 4: one out of 5 ran
> > https://it-service.zae-bayern.de/Team/index.php/s/apra2zS7FHdqDQy
> >
> > New patch:
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 1: none ran*
> > https://it-service.zae-bayern.de/Team/index.php/s/jAD52jBgNddrS3w
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 2: none ran*
> > https://it-service.zae-bayern.de/Team/index.php/s/bcRjtz7r9NekzKB
> > -ntmpi 1 -ntomp 22 -pin on -pinstride 4: none ran*
> > https://it-service.zae-bayern.de/Team/index.php/s/b3zp8DNztjE6ssF
> >
>
> This still doesn't tell much more unfortunately.
>
> Two more things to try (can be combined)
> - please set build with setting first
> cmake . -DCMAKE_BUILD_TYPE=RelWithAssert
> this may give us some extra debugging information during runs
> - please use this patch now -- it will print some additional stuff to the
> standard error output so please grab that and share it:
> https://termbin.com/zq4q
> (you can redirect the output e.g. by gmx mdrun > mdrun.out 2>&1)
> - try running (with the above binary build + patch) the above failing case
> repeasted a few times:
> -nsteps 0 -nb gpu -pme cpu -bonded cpu
> -nsteps 0 -nb gpu -pme gpu -bonded cpu
> -nsteps 0 -nb gpu -pme gpu -bonded gpu
>
>
>
> > Regarding the Regressiontest:
> >
> > Sorry I didn't get it at the first time.
> > If the md.log files are enough here is a folder for the failed parts of
> > the complex regression test:
> > https://it-service.zae-bayern.de/Team/index.php/s/64KAQBgNoPm4rJ2
> >
> > If you need any other files or the full directories please let me know.
> >
>
> Hmmm, looks like there are more issues here, some log files look truncated
> others indicate termination by LINCS errors. Yes, the mdrun.out and
> checkpot* files would be useful. How about just making a tarball of the
> whole complex directory and sharing that?
>
>
>
>
> Hopefully these tests will shed some light on what the issue is.
>
> Cheers,
> --
> Szilard
>
> Again, a lot of thank for your support.
>
>
> > Best wishes,
> > Steffi
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von
> Szilárd
> > Páll
> > Gesendet: Dienstag, 26. März 2019 16:57
> > An: Discussion list for GROMACS users
> > Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
> >
> > Hi Steffi,
> >
> > Thanks for running the tests; yes, the patch file was meant to be applied
> > to the unchanged GROMACS 2019 code.
> >
> > Please also share the log files from thr failed runs, not just the
> > copy-paste of the fatal error -- as a result of the additional check
> there
> > might have been a note printed which I was after.
> >
> > Regarding the regression tests, what I would like to have is the actual
> > directories of the tests that failed, i.e. as your log indicates a few of
> > the complex tests at least.
> >
> > Cheers,
> > --
> > Szilárd
> >
> > On Tue, Mar 26, 2019 at 1:44 PM Tafelmeier, Stefanie <
> > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> >
> > > Hi Szilárd,
> > >
> > > thanks again for your answer.
> > > Regarding the tests:
> > > without the new patch:
> > >
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 1: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 2: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 4: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 8: all ran
> > > and
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 1: none ran*
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 2: none ran*
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 4: one out of 5 ran
> > >
> > >
> > > With the new patch (devicebuffer.cuh had to be the original, right? The
> > > already patched didn't work as the lines didn't fit, as far as I
> > > understood.):
> > >
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 1: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 2: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 4: all ran
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 8: all ran
> > > and
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 1: none ran*
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 2: none ran*
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 4: none ran*
> > >
> > > * Fatal error:
> > > Asynchronous H2D copy failed: invalid argument
> > >
> > >
> > > Regarding the regressiontest:
> > > The LastTest.log is available here:
> > > https://it-service.zae-bayern.de/Team/index.php/s/3sdki7Cf2x2CEQi
> > > this was not given in the log:
> > > The following tests FAILED:
> > > 42 - regressiontests/complex (Timeout)
> > > 46 - regressiontests/essentialdynamics (Failed)
> > > Errors while running CTest
> > > CMakeFiles/run-ctest-nophys.dir/build.make:57: recipe for
> target
> > > 'CMakeFiles/run-ctest-nophys' failed
> > > make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
> > > CMakeFiles/Makefile2:1397: recipe for target
> > > 'CMakeFiles/run-ctest-nophys.dir/all'failed
> > > make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
> > > CMakeFiles/Makefile2:1177: recipe for target
> > > 'CMakeFiles/check.dir/rule' failed
> > > make[1]: *** [CMakeFiles/check.dir/rule] Error 2
> > > Makefile:626: recipe for target 'check' failed
> > > make: *** [check] Error 2
> > >
> > > Many thanks again.
> > > Best wishes,
> > > Steffi
> > >
> > >
> > >
> > >
> > >
> > > -----Ursprüngliche Nachricht-----
> > > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> > > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von
> > Szilárd
> > > Páll
> > > Gesendet: Montag, 25. März 2019 20:13
> > > An: Discussion list for GROMACS users
> > > Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
> > >
> > > Hi,
> > >
> > >
> > >
> > > --
> > > Szilárd
> > >
> > >
> > > On Mon, Mar 18, 2019 at 2:34 PM Tafelmeier, Stefanie <
> > > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> > >
> > > > Hi,
> > > >
> > > > Many thanks again.
> > > >
> > > > Regarding the tests:
> > > > - ntmpi 1 -ntomp 22 -pin on
> > > > >OK, so this suggests that your previously successful 22-thread runs
> > did
> > > > not
> > > > turn on pinning, I assume?
> > > > It seems so, yet it does not run successfully each time. But if done
> > with
> > > > 20-threads, which works usually without error, it does not look like
> > the
> > > > pinning is turned on.
> > > >
> > >
> > > Pinning is only turned on if mdrun can safely assume that the cores of
> > the
> > > node are not shared by multiple applications. This assumption can only
> be
> > > made if all hardware threads of the entire node are used the run itself
> > > (i.e. in your case 2x22 cores with HyperThreadince hence 2 threads
> each =
> > > 88 threads).
> > >
> > > -ntmpi 1 -ntomp 1 -pin on; runs
> > > > -ntmpi 1 -ntomp 2 -pin on; runs
> > > >
> > > > - ntmpi 24 -ntomp 1 -pinstride 1 -pin on; runs
> > > > - ntmpi 24 -ntomp 1 -pinstride 2 -pin on; runs
> > > >
> > > > After patch supplied:
> > > > - ntmpi 1 -ntomp 22 -pin on; sometime runs - sometimes doesn't* ->
> > > > md_run.log at :
> > > > https://it-service.zae-bayern.de/Team/index.php/s/ezXWnQ2pGNeFx6T
> > > >
> > > > md_norun.log at:
> > > > https://it-service.zae-bayern.de/Team/index.php/s/wYPY7dWEJdwmqJi
> > > > - ntmpi 1 -ntomp 22 -pin off; sometime runs - sometimes doesn't*
> (ran
> > > > before)
> > > > - ntmpi 1 -ntomp 23 -pin off; doesn't work* (ran before)
> > > >
> > > > - ntmpi 1 -ntomp 23 -pinstride 1 -pin on; doesn't work*
> > > >
> > > > - ntmpi 1 -ntomp 23 -pinstride 2 -pin on; doesn't work* (ran before)
> > > >
> > >
> > >
> > > The suspicious thing is that the patch I made only improves the
> verbosity
> > > of the error reporting, it should have no impact on whether the error
> is
> > > triggered or not. Considering the above behavior it seems that pinning
> > (at
> > > least the patters tried) has no influence on whether the runs work.
> > >
> > > Can you please try:
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 1
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 2
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 4
> > > -ntmpi 1 -ntomp 11 -pin on -pinstride 8
> > > and
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 1
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 2
> > > -ntmpi 1 -ntomp 22 -pin on -pinstride 4
> > >
> > > And please run these 5 times each (-nsteps 0 is fine to make things
> > quick).
> > >
> > > Also, please use this patch
> > > https://termbin.com/r8kk
> > > The same way as you did the one before, it adds another check that
> might
> > > shed some light on what's going on.
> > >
> > > - ntmpi 24 -ntomp 1 -pinstride 1 -pin on; runs
> > > > - ntmpi 24 -ntomp 1 -pinstride 2 -pin on; runs
> > > >
> > > > * Fatal error:
> > > > Asynchronous H2D copy failed: invalid argument
> > > >
> > > > When compiling, the make check shows that the regressiontest-complex
> > and
> > > > regressiontest-essential dynamics fail.
> > > > I am not sure if this is correlated?
> > > >
> > >
> > > It might be, please share the outputs of the regressiontests.
> > >
> > > --
> > > Szilárd
> > >
> > >
> > > > Many thanks in advance.
> > > > Best wishes,
> > > > Steffi
> > > >
> > > >
> > > >
> > > >
> > > > -----Ursprüngliche Nachricht-----
> > > > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> > > > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von
> > > Szilárd
> > > > Páll
> > > > Gesendet: Freitag, 15. März 2019 17:57
> > > > An: Discussion list for GROMACS users
> > > > Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
> > > >
> > > > On Fri, Mar 15, 2019 at 5:02 PM Tafelmeier, Stefanie <
> > > > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > about the tests:
> > > > > - ntmpi 1 -ntomp 22 -pin on; doesn't work*
> > > > >
> > > >
> > > > OK, so this suggests that your previously successful 22-thread runs
> did
> > > not
> > > > turn on pinning, I assume?
> > > > Can you please try:
> > > > -ntmpi 1 -ntomp 1 -pin on
> > > > -ntmpi 1 -ntomp 2 -pin on
> > > > that is to check does pinning work at all?
> > > > Also, please try one/both of the above (assuming they fail with) same
> > > > binary, but CPU-only run, i.e.
> > > > -ntmpi 1 -ntomp 1 -pin on -nb cpu
> > > >
> > > >
> > > > > - ntmpi 1 -ntomp 22 -pin off; runs
> > > > > - ntmpi 1 -ntomp 23 -pin off; runs
> > > > > - ntmpi 1 -ntomp 23 -pinstride 1 -pin on; doesn't work*
> > > > > - ntmpi 1 -ntomp 23 -pinstride 2 -pin on; runs
> > > > > - ntmpi 23 -ntomp 1 -pinstride 1 -pin on; doesn't work**
> > > > > - ntmpi 23 -ntomp 1 -pinstride 2 -pin on; doesn't work**
> > > > >
> > > >
> > > > Just to confirm, can you please run the **'s with either -ntmpi 24
> (to
> > > > avoid the DD error).
> > > >
> > > >
> > > > >
> > > > > *Error as known.
> > > > >
> > > > > **The number of ranks you selected (23) contains a large prime
> factor
> > > 23.
> > > > > In
> > > > > most cases this will lead to bad performance. Choose a number with
> > > > smaller
> > > > > prime factors or set the decomposition (option -dd) manually.
> > > > >
> > > > > The log file is at:
> > > > > https://it-service.zae-bayern.de/Team/index.php/s/fypKB9iZJz8yXq8
> > > > >
> > > >
> > > > Will have a look and get back with more later.
> > > >
> > > >
> > > > >
> > > > > Many thanks again,
> > > > > Steffi
> > > > >
> > > > > -----Ursprüngliche Nachricht-----
> > > > > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> > > > > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von
> > > > Szilárd
> > > > > Páll
> > > > > Gesendet: Freitag, 15. März 2019 16:27
> > > > > An: Discussion list for GROMACS users
> > > > > Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
> > > > >
> > > > > Hi,
> > > > >
> > > > > Please share log files with an external service attachments are not
> > > > > accepted on the list.
> > > > >
> > > > > Also, when checking the error with the patch supplied, please run
> the
> > > > > following cases -- no long runs are needed just want to know which
> of
> > > > these
> > > > > runs and which of these doesn't:
> > > > > - ntmpi 1 -ntomp 22 -pin on
> > > > > - ntmpi 1 -ntomp 22 -pin off
> > > > > - ntmpi 1 -ntomp 23 -pin off
> > > > > - ntmpi 1 -ntomp 23 -pinstride 1 -pin on
> > > > > - ntmpi 1 -ntomp 23 -pinstride 2 -pin on
> > > > > - ntmpi 23 -ntomp 1 -pinstride 1 -pin on
> > > > > - ntmpi 23 -ntomp 1 -pinstride 2 -pin on
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Szilárd
> > > > >
> > > > >
> > > > > On Fri, Mar 15, 2019 at 4:04 PM Tafelmeier, Stefanie <
> > > > > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> > > > >
> > > > > > Hi Szilárd,
> > > > > >
> > > > > > thanks for the quick reply.
> > > > > > About the first suggestion, I'll try and give feedback soon.
> > > > > >
> > > > > > Regarding the second, I attached the log-file for the case of
> > > > > > mdrun -v -nt 25
> > > > > > Which ends in the known error message.
> > > > > >
> > > > > > Again, thanks a lot for your information and help.
> > > > > >
> > > > > > Best wishes,
> > > > > > Steffi
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Ursprüngliche Nachricht-----
> > > > > > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> > > > > > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag
> von
> > > > > Szilárd
> > > > > > Páll
> > > > > > Gesendet: Freitag, 15. März 2019 15:30
> > > > > > An: Discussion list for GROMACS users
> > > > > > Betreff: Re: [gmx-users] WG: WG: Issue with CUDA and gromacs
> > > > > >
> > > > > > Hi Stefanie,
> > > > > >
> > > > > > Unless and until the error and performance-related concerns prove
> > to
> > > be
> > > > > > related, let's keep those separate.
> > > > > >
> > > > > > I'd first focus on the former. To be honest, I've never
> encountered
> > > > such
> > > > > an
> > > > > > issue where if you use more than a certain number of threads, the
> > run
> > > > > > aborts with that error. To investigate further can you please
> apply
> > > the
> > > > > > following patch file which hopefully give more context to the
> > error:
> > > > > > https://termbin.com/uhgp
> > > > > > (e.g. you can execute the following to accomplish that:
> > > > > > curl https://termbin.com/uhgp > devicebuffer.cuh.patch && patch
> > -p0
> > > <
> > > > > > devicebuffer.cuh.patch)
> > > > > >
> > > > > > Regarding the performance-related questions, can you please
> share a
> > > > full
> > > > > > log file of the runs so we can see the machine config, simulation
> > > > > > system/settings, etc. Without that it is hard to judge what's
> best
> > > for
> > > > > your
> > > > > > case. However, if you only have a single GPU (which seems to be
> the
> > > > case
> > > > > > based on the log excerpts) along those two rather beefy CPUs,
> than
> > > you
> > > > > will
> > > > > > likely not get much benefit from using all cores and it is normal
> > > that
> > > > > you
> > > > > > see little to no improvement from using cores of a second CPU
> > socket.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Szilárd
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 14, 2019 at 12:47 PM Tafelmeier, Stefanie <
> > > > > > Stefanie.Tafelmeier at zae-bayern.de> wrote:
> > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > > I was not sure if the email before reached you, but again many
> > > thanks
> > > > > for
> > > > > > > your reply Szilárd.
> > > > > > >
> > > > > > > As written below we are still facing a problem with the
> > performance
> > > > of
> > > > > > > your workstation.
> > > > > > > I wrote before because of the error message when keeping
> > occurring
> > > > for
> > > > > > > mdrun simulation:
> > > > > > >
> > > > > > > Assertion failed:
> > > > > > > Condition: stat == cudaSuccess
> > > > > > > Asynchronous H2D copy failed
> > > > > > >
> > > > > > > As I mentioned all Versions to install (Gormacs, Cuda, nvcc,
> gcc)
> > > are
> > > > > the
> > > > > > > newest once now.
> > > > > > >
> > > > > > > If I run mdrun without further settings it will lead to this
> > error
> > > > > > > message. If I run it and choose the thread amount directly the
> > > mdrun
> > > > is
> > > > > > > performing well. But only for –nt numbers between 1 – 22.
> Higher
> > > ones
> > > > > > again
> > > > > > > lead to the before mentioned error message.
> > > > > > >
> > > > > > > In order to investigate in more detail, I tried different
> > versions
> > > > for
> > > > > > > –nt, –ntmpi – ntomp also combined with –npme:
> > > > > > > - The best performance in the sense of ns/day is with –nt
> > 22
> > > > > > > respectively –ntomp 22 alone. But then only 22 threads are
> > > involved.
> > > > > > Which
> > > > > > > is fine if I run more than one mdrun simultaneously, as I can
> > > > > distribute
> > > > > > > the other 66 threads. The GPU usage is then around 65%.
> > > > > > > - A similar good performance is reached with mdrun
> -ntmpi
> > 4
> > > > > -ntomp
> > > > > > > 18 -npme 1 -pme gpu -nb gpu. But then 44 threads are involved.
> > The
> > > > GPU
> > > > > > > usage is then around 50%.
> > > > > > >
> > > > > > > I read the information on
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
> > > > > > > which was very helpful, bur some things are still not clear now
> > to
> > > > me:
> > > > > > > I was wondering if there is any other enhancement of the
> > > performance?
> > > > > Or
> > > > > > > what is the reason, that –nt maximum is at 22 threads? Could
> this
> > > be
> > > > > > > connected to the sockets (see details below) of your
> workstation?
> > > > > > > It is not clear to me how a number of thread (-nt) higher 22
> can
> > > lead
> > > > > to
> > > > > > > the error regarding the Asynchronous H2D copy)
> > > > > > >
> > > > > > > Please excuse all these questions. I would appreciate a lot if
> > you
> > > > > might
> > > > > > > have a hint for this problem as well.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Steffi
> > > > > > >
> > > > > > > -----
> > > > > > >
> > > > > > > The workstation details are:
> > > > > > > Running on 1 node with total 44 cores, 88 logical cores, 1
> > > compatible
> > > > > GPU
> > > > > > > Hardware detected:
> > > > > > >
> > > > > > > CPU info:
> > > > > > > Vendor: Intel
> > > > > > > Brand: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
> > > > > > > Family: 6 Model: 85 Stepping: 4
> > > > > > > Features: aes apic avx avx2 avx512f avx512cd avx512bw
> > avx512vl
> > > > > clfsh
> > > > > > > cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc
> > pcid
> > > > > > pclmuldq
> > > > > > > pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1
> sse4.2
> > > > ssse3
> > > > > > tdt
> > > > > > > x2apic
> > > > > > >
> > > > > > > Number of AVX-512 FMA units: 2
> > > > > > > Hardware topology: Basic
> > > > > > > Sockets, cores, and logical processors:
> > > > > > > Socket 0: [ 0 44] [ 1 45] [ 2 46] [ 3 47] [
> > 4
> > > > > 48] [
> > > > > > > 5 49] [ 6 50] [ 7 51] [ 8 52] [ 9 53] [ 10 54]
> [
> > > 11
> > > > > 55]
> > > > > > > [ 12 56] [ 13 57] [ 14 58] [ 15 59] [ 16 60] [ 17
> > 61] [
> > > > 18
> > > > > > > 62] [ 19 63] [ 20 64] [ 21 65]
> > > > > > > Socket 1: [ 22 66] [ 23 67] [ 24 68] [ 25 69] [
> > 26
> > > > > 70] [
> > > > > > > 27 71] [ 28 72] [ 29 73] [ 30 74] [ 31 75] [ 32 76]
> [
> > > 33
> > > > > 77]
> > > > > > > [ 34 78] [ 35 79] [ 36 80] [ 37 81] [ 38 82] [ 39
> > 83] [
> > > > 40
> > > > > > > 84] [ 41 85] [ 42 86] [ 43 87]
> > > > > > > GPU info:
> > > > > > > Number of GPUs detected: 1
> > > > > > > #0: NVIDIA Quadro P6000, compute cap.: 6.1, ECC: no, stat:
> > > > > > compatible
> > > > > > >
> > > > > > > -----
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -----Ursprüngliche Nachricht-----
> > > > > > > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
> [mailto:
> > > > > > > gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag
> > von
> > > > > > Szilárd
> > > > > > > Páll
> > > > > > > Gesendet: Donnerstag, 31. Januar 2019 17:15
> > > > > > > An: Discussion list for GROMACS users
> > > > > > > Betreff: Re: [gmx-users] WG: Issue with CUDA and gromacs
> > > > > > >
> > > > > > > On Thu, Jan 31, 2019 at 2:14 PM Szilárd Páll <
> > > pall.szilard at gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Jan 30, 2019 at 5:15 PM Tafelmeier, Stefanie
> > > > > > > > <Stefanie.Tafelmeier at zae-bayern.de> wrote:
> > > > > > > > >
> > > > > > > > > Dear all,
> > > > > > > > >
> > > > > > > > > We are facing an issue with the CUDA toolkit.
> > > > > > > > > We tried several combinations of gromacs versions and CUDA
> > > > > Toolkits.
> > > > > > > No Toolkit older than 9.2 was possible to try as there are no
> > > driver
> > > > > for
> > > > > > > nvidia available for a Quadro P6000.
> > > > > > > > > Gromacs
> > > > > > > >
> > > > > > > > Install the latest 410.xx drivers and it will work; the
> NVIDIA
> > > > driver
> > > > > > > > download website (https://www.nvidia.com/Download/index.aspx
> )
> > > > > > > > recommends 410.93.
> > > > > > > >
> > > > > > > > Here's a system with CUDA 10-compatible driver running o a
> > system
> > > > > with
> > > > > > > > a P6000: https://termbin.com/ofzo
> > > > > > >
> > > > > > > Sorry, I misread that as "CUDA >=9.2 was not possible".
> > > > > > >
> > > > > > > Note that the driver is backward compatible, so you can use a
> new
> > > > > > > driver with older CUDA versions.
> > > > > > >
> > > > > > > Also note that the oldest driver NVIDIA claims to have P6000
> > > support
> > > > > > > is 390.59 which is, as far as I know, one gen older than the
> 396
> > > that
> > > > > > > the CUDA 9.2 toolkit came with. This is however, not something
> > I'd
> > > > > > > recommend pursuing, use a new driver from the official site
> with
> > > any
> > > > > > > CUDA version that GROMACS supports and it should be fine.
> > > > > > >
> > > > > > > >
> > > > > > > > > CUDA
> > > > > > > > >
> > > > > > > > > Error message
> > > > > > > > >
> > > > > > > > > 2019
> > > > > > > > >
> > > > > > > > > 10.0
> > > > > > > > >
> > > > > > > > > gmx mdrun:
> > > > > > > > > Assertion failed:
> > > > > > > > > Condition: stat == cudaSuccess
> > > > > > > > > Asynchronous H2D copy failed
> > > > > > > > >
> > > > > > > > > 2019
> > > > > > > > >
> > > > > > > > > 9.2
> > > > > > > > >
> > > > > > > > > gmx mdrun:
> > > > > > > > > Assertion failed:
> > > > > > > > > Condition: stat == cudaSuccess
> > > > > > > > > Asynchronous H2D copy failed
> > > > > > > > >
> > > > > > > > > 2018.5
> > > > > > > > >
> > > > > > > > > 9.2
> > > > > > > > >
> > > > > > > > > gmx mdrun: Fatal error:
> > > > > > > > > HtoD cudaMemcpyAsync failed: invalid argument
> > > > > > > >
> > > > > > > > Can we get some more details on these, please? complete log
> > files
> > > > > > > > would be a good start.
> > > > > > > >
> > > > > > > > > 5.1.5
> > > > > > > > >
> > > > > > > > > 9.2
> > > > > > > > >
> > > > > > > > > Installation make: nvcc fatal : Unsupported gpu
> > architecture
> > > > > > > 'compute_20'*
> > > > > > > > >
> > > > > > > > > 2016.2
> > > > > > > > >
> > > > > > > > > 9.2
> > > > > > > > >
> > > > > > > > > Installation make: nvcc fatal : Unsupported gpu
> > architecture
> > > > > > > 'compute_20'*
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *We also tried to set the target CUDA architectures as
> > > described
> > > > in
> > > > > > > the installation guide (
> > > > > > > manual.gromacs.org/documentation/2019/install-guide/index.html
> ).
> > > > > > > Unfortunately it didn't work.
> > > > > > > >
> > > > > > > > What does it mean that it didn't work? Can you share the
> > command
> > > > you
> > > > > > > > used and what exactly did not work?
> > > > > > > >
> > > > > > > > For the P6000 which is a "compute capability 6.1" device (for
> > > > anyone
> > > > > > > > who needs to look it up, go here:
> > > > > > > > https://developer.nvidia.com/cuda-gpus), you should set
> > > > > > > > cmake ../ -DGMX_CUDA_TARGET_SM="61"
> > > > > > > >
> > > > > > > > --
> > > > > > > > Szilárd
> > > > > > > >
> > > > > > > > > Performing simulations on CPU only always works, yet of
> cause
> > > are
> > > > > > more
> > > > > > > slowly than they could be with additionally using the GPU.
> > > > > > > > > The issue #2761 (https://redmine.gromacs.org/issues/2762)
> > > seems
> > > > > > > similar to our problem.
> > > > > > > > > Even though this issue is still open, we wanted to ask if
> you
> > > can
> > > > > > give
> > > > > > > us any information about how to solve this problem?
> > > > > > > > >
> > > > > > > > > Many thanks in advance.
> > > > > > > > > Best regards,
> > > > > > > > > Stefanie Tafelmeier
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Further details if necessary:
> > > > > > > > > The workstation:
> > > > > > > > > 2 x Xeon Gold 6152 @ 3,7Ghz (22 K, 44Th, AVX512)
> > > > > > > > > Nvidia Quadro P6000 with 3840 Cuda-Cores
> > > > > > > > >
> > > > > > > > > The simulations system:
> > > > > > > > > Long chain alkanes (previously used with gromacs 5.1.5 and
> > CUDA
> > > > > 7.5 -
> > > > > > > worked perfectly)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ZAE Bayern
> > > > > > > > > Stefanie Tafelmeier
> > > > > > > > > Bereich Energiespeicherung/Division Energy Storage
> > > > > > > > > Thermische Energiespeicher/Thermal Energy Storage
> > > > > > > > > Walther-Meißner-Str. 6
> > > > > > > > > 85748 Garching
> > > > > > > > >
> > > > > > > > > Tel.: +49 89 329442-75
> > > > > > > > > Fax: +49 89 329442-12
> > > > > > > > > Stefanie.tafelmeier at zae-bayern.de<mailto:
> > > > > > > Stefanie.tafelmeier at zae-bayern.de>
> > > > > > > > > http://www.zae-bayern.de<http://www.zae-bayern.de/>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ZAE Bayern - Bayerisches Zentrum für Angewandte
> > > Energieforschung
> > > > e.
> > > > > > V.
> > > > > > > > > Vorstand/Board:
> > > > > > > > > Prof. Dr. Hartmut Spliethoff (Vorsitzender/Chairman),
> > > > > > > > > Prof. Dr. Vladimir Dyakonov
> > > > > > > > > Sitz/Registered Office: Würzburg
> > > > > > > > > Registergericht/Register Court: Amtsgericht Würzburg
> > > > > > > > > Registernummer/Register Number: VR 1386
> > > > > > > > >
> > > > > > > > > Sämtliche Willenserklärungen, z. B. Angebote, Aufträge,
> > Anträge
> > > > und
> > > > > > > Verträge, sind für das ZAE Bayern nur in schriftlicher und
> > > > > ordnungsgemäß
> > > > > > > unterschriebener Form rechtsverbindlich. Diese E-Mail ist
> > > > > ausschließlich
> > > > > > > zur Nutzung durch den/die vorgenannten Empfänger bestimmt.
> > Jegliche
> > > > > > > unbefugte Offenbarung, Nutzung oder Verbreitung, sei es
> insgesamt
> > > > oder
> > > > > > > teilweise, ist untersagt. Sollten Sie diese E-Mail irrtümlich
> > > > erhalten
> > > > > > > haben, benachrichtigen Sie bitte unverzüglich den Absender und
> > > > löschen
> > > > > > Sie
> > > > > > > diese E-Mail.
> > > > > > > > >
> > > > > > > > > Any declarations of intent, such as quotations, orders,
> > > > > applications
> > > > > > > and contracts, are legally binding for ZAE Bayern only if
> > expressed
> > > > in
> > > > > a
> > > > > > > written and duly signed form. This e-mail is intended solely
> for
> > > use
> > > > by
> > > > > > the
> > > > > > > recipient(s) named above. Any unauthorised disclosure, use or
> > > > > > > dissemination, whether in whole or in part, is prohibited. If
> you
> > > > have
> > > > > > > received this e-mail in error, please notify the sender
> > immediately
> > > > and
> > > > > > > delete this e-mail.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Gromacs Users mailing list
> > > > > > > > >
> > > > > > > > > * Please search the archive at
> > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > > > > posting!
> > > > > > > > >
> > > > > > > > > * Can't post? Read
> > > http://www.gromacs.org/Support/Mailing_Lists
> > > > > > > > >
> > > > > > > > > * For (un)subscribe requests visit
> > > > > > > > >
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > > > or
> > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > > --
> > > > > > > Gromacs Users mailing list
> > > > > > >
> > > > > > > * Please search the archive at
> > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > > > > posting!
> > > > > > >
> > > > > > > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > > > > >
> > > > > > > * For (un)subscribe requests visit
> > > > > > >
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > or
> > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > > --
> > > > > > > Gromacs Users mailing list
> > > > > > >
> > > > > > > * Please search the archive at
> > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > > > > posting!
> > > > > > >
> > > > > > > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > > > > >
> > > > > > > * For (un)subscribe requests visit
> > > > > > >
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > or
> > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > --
> > > > > > Gromacs Users mailing list
> > > > > >
> > > > > > * Please search the archive at
> > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> before
> > > > > > posting!
> > > > > >
> > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > > >
> > > > > > * For (un)subscribe requests visit
> > > > > >
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > or
> > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > --
> > > > > > Gromacs Users mailing list
> > > > > >
> > > > > > * Please search the archive at
> > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> before
> > > > > > posting!
> > > > > >
> > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > > >
> > > > > > * For (un)subscribe requests visit
> > > > > >
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > or
> > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > --
> > > > > Gromacs Users mailing list
> > > > >
> > > > > * Please search the archive at
> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > > posting!
> > > > >
> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >
> > > > > * For (un)subscribe requests visit
> > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > --
> > > > > Gromacs Users mailing list
> > > > >
> > > > > * Please search the archive at
> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > > posting!
> > > > >
> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >
> > > > > * For (un)subscribe requests visit
> > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > > send a mail to gmx-users-request at gromacs.org.
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list