[gmx-users] GPU job often tailed

Szilárd Páll pall.szilard at gmail.com
Wed Feb 12 13:54:23 CET 2014


Your mail reads like an FYI, what is the question?

In case if you were wondering what causes this, it could be simply a
soft error, but it's hard to tell. What GPU are you running on? If
it's in your own workstation, you could consider running a longer
stress-test on it using e.g. CUDA memtest.

--
Szilárd


On Wed, Feb 12, 2014 at 10:00 AM, Albert <mailmd2011 at gmail.com> wrote:
> Hello:
>
> I noticed that my GPU job often failed from time to time, here is the
> informations:
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.5
> Source code file:
> /home/albert/install/source_code/gromacs-4.6.5/src/mdlib/nbnxn_cuda/nbnxn_cuda.cu,
> line: 591
>
> Fatal error:
> cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure
>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> "RTFM" (B. Hess)
>
> Error on node 1, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 1 out of 4
>
> gcq#261: "RTFM" (B. Hess)
>
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode -1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 6097 on
> node node3 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
>
>
> I use -cpi -append option to restart the job, but it still failed 1-2 days
> later with the same messages.
>
> thank you very much
>
> best
> Albert
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list