[gmx-users] GPU job often tailed

Albert mailmd2011 at gmail.com
Wed Feb 12 10:01:03 CET 2014


Hello:

I noticed that my GPU job often failed from time to time, here is the 
informations:


-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.5
Source code file: 
/home/albert/install/source_code/gromacs-4.6.5/src/mdlib/nbnxn_cuda/nbnxn_cuda.cu, 
line: 591

Fatal error:
cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"RTFM" (B. Hess)

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 1 out of 4

gcq#261: "RTFM" (B. Hess)

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6097 on
node node3 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------



I use -cpi -append option to restart the job, but it still failed 1-2 
days later with the same messages.

thank you very much

best
Albert


More information about the gromacs.org_gmx-users mailing list