[gmx-users] Gromacs 2019.4 - cudaStreamSynchronize failed issue

Chenou Zhang czhan178 at asu.edu
Wed Dec 4 21:35:12 CET 2019


We did test that.
Our cluster has total 11 GPU nodes and I ran 20 tests over all of them. 7
out of the 20 tests did have the potential energy jump issue and they were
running on 5 different nodes.
So I tend to believe this issue happens on any of those nodes.

On Wed, Dec 4, 2019 at 1:14 PM Szilárd Páll <pall.szilard at gmail.com> wrote:

> The fact that you are observing errors alo the energies to be off by so
> much and that it reproduces with multiple inputs suggest that this may not
> a code issue. Did you do all runs that failed on the same hardware? Have
> you excluded the option that one of those GeForce cards may be flaky?
>
> --
> Szilárd
>
>
> On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang <czhan178 at asu.edu> wrote:
>
> > We tried the same gmx settings in 2019.4 with different protein systems.
> > And we got the same weird potential energy jump  within 1000 steps.
> >
> > ```
> >
> > Step           Time
> >               0        0.00000
> >  Energies (kJ/mol)
> >            Bond            U-B    Proper Dih.  Improper Dih.      CMAP
> Dih.
> >     2.08204e+04    9.92358e+04    6.53063e+04    1.06706e+03
>  -2.75672e+02
> >           LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul.
> recip.
> >     1.50031e+04   -4.86857e+04    3.10386e+04   -1.09745e+06
> 4.81832e+03
> >       Potential    Kinetic En.   Total Energy  Conserved En.
> Temperature
> >    -9.09123e+05    2.80635e+05   -6.28487e+05   -6.28428e+05
> 3.04667e+02
> >  Pressure (bar)   Constr. rmsd
> >    -1.56013e+00    3.60634e-06
> >
> > DD  step 999 load imb.: force 14.6%  pme mesh/force 0.581
> >            Step           Time
> >            1000        2.00000
> >
> > Energies (kJ/mol)
> >            Bond            U-B    Proper Dih.  Improper Dih.      CMAP
> Dih.
> >     2.04425e+04    9.92768e+04    6.52873e+04    1.02016e+03
>  -2.45851e+02
> >           LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul.
> recip.
> >     1.49863e+04   -4.91092e+04    3.10572e+04   -1.09508e+06
> 4.97942e+03
> >       Potential    Kinetic En.   Total Energy  Conserved En.
> Temperature
> >     1.35726e+35    2.77598e+05    1.35726e+35    1.35726e+35
> 3.01370e+02
> >  Pressure (bar)   Constr. rmsd
> >    -7.55250e+01    3.63239e-06
> >
> >  DD  step 1999 load imb.: force 16.1%  pme mesh/force 0.598
> >            Step           Time
> >            2000        4.00000
> >
> > Energies (kJ/mol)
> >            Bond            U-B    Proper Dih.  Improper Dih.      CMAP
> Dih.
> >     1.99521e+04    9.97482e+04    6.49595e+04    1.00798e+03
>  -2.42567e+02
> >           LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul.
> recip.
> >     1.50156e+04   -4.85324e+04    3.01944e+04   -1.09620e+06
> 4.82958e+03
> >       Potential    Kinetic En.   Total Energy  Conserved En.
> Temperature
> >     1.35726e+35    2.79206e+05    1.35726e+35    1.35726e+35
> 3.03115e+02
> >  Pressure (bar)   Constr. rmsd
> >    -5.50508e+01    3.64353e-06
> >
> > DD  step 2999 load imb.: force 16.6%  pme mesh/force 0.602
> >            Step           Time
> >            3000        6.00000
> >
> >
> > Energies (kJ/mol)
> >            Bond            U-B    Proper Dih.  Improper Dih.      CMAP
> Dih.
> >     1.98590e+04    9.88100e+04    6.50934e+04    1.07048e+03
>  -2.38831e+02
> >           LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)   Coul.
> recip.
> >     1.49609e+04   -4.93079e+04    3.12273e+04   -1.09582e+06
> 4.83209e+03
> >       Potential    Kinetic En.   Total Energy  Conserved En.
> Temperature
> >     1.35726e+35    2.79438e+05    1.35726e+35    1.35726e+35
> 3.03367e+02
> >  Pressure (bar)   Constr. rmsd
> >     7.62438e+01    3.61574e-06
> >
> > ```
> >
> > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham <mark.j.abraham at gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > What driver version is reported in the respective log files? Does the
> > error
> > > persist if mdrun -notunepme is used?
> > >
> > > Mark
> > >
> > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, <czhan178 at asu.edu> wrote:
> > >
> > > > Hi Gromacs developers,
> > > >
> > > > I'm currently running gromacs 2019.4 on our university's HPC cluster.
> > To
> > > > fully utilize the GPU nodes, I followed notes on
> > > >
> > > >
> > >
> >
> http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html
> > > > .
> > > >
> > > >
> > > > And here is the command I used for my runs.
> > > > ```
> > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu
> > > -ntomp
> > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt
> 60
> > > -cpi
> > > > -noappend
> > > > ```
> > > >
> > > > And for some of those runs, they might fail with the following error:
> > > > ```
> > > > -------------------------------------------------------
> > > >
> > > > Program:     gmx mdrun, version 2019.4
> > > >
> > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229)
> > > >
> > > > MPI rank:    3 (out of 8)
> > > >
> > > >
> > > >
> > > > Fatal error:
> > > >
> > > > cudaStreamSynchronize failed: an illegal memory access was
> encountered
> > > >
> > > >
> > > >
> > > > For more information and tips for troubleshooting, please check the
> > > GROMACS
> > > >
> > > > website at http://www.gromacs.org/Documentation/Errors
> > > > ```
> > > >
> > > > we also had a different error from slurm system:
> > > > ```
> > > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446:
> 467.9
> > > > M-cycles
> > > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372:
> 451.4
> > > > M-cycles
> > > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866
> Segmentation
> > > > fault      gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin
> on
> > > -nb
> > > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh
> > $HOURS
> > > > -cpt 60 -cpi -noappend
> > > > ```
> > > >
> > > > We first thought this could due to compiler issue and tried different
> > > > settings as following:
> > > > ===test1===
> > > > <source>
> > > > module load cuda/9.2.88.1
> > > > module load gcc/7.3.0
> > > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC
> > > > </source>
> > > > ===test2===
> > > > <source>
> > > > module load cuda/9.2.88.1
> > > > module load gcc/6x
> > > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC
> > > > </source>
> > > > ===test3===
> > > > <source>
> > > > module load cuda/9.2.148
> > > > module load gcc/7.3.0
> > > > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC
> > > > </source>
> > > > ===test4===
> > > > <source>
> > > > module load cuda/9.2.148
> > > > module load gcc/6x
> > > > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC
> > > > </source>
> > > > ===test5===
> > > > <source>
> > > > module load cuda/9.1.85
> > > > module load gcc/6x
> > > > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC
> > > > </source>
> > > > ===test6===
> > > > <source>
> > > > module load cuda/9.0.176
> > > > module load gcc/6x
> > > > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC
> > > > </source>
> > > > ===test7===
> > > > <source>
> > > > module load cuda/9.2.88.1
> > > > module load gccgpu/7.4.0
> > > > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC
> > > > </source>
> > > >
> > > > However we still ended up with the same errors showed above. Does
> > anyone
> > > > know where does the cudaStreamSynchronize come in? Or am I wrongly
> > using
> > > > those gmx gpu commands?
> > > >
> > > > Any input will be appreciated!
> > > >
> > > > Thanks!
> > > > Chenou
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list