[gmx-users] GPU job often tailed
Albert
mailmd2011 at gmail.com
Wed Feb 12 10:01:03 CET 2014
Hello:
I noticed that my GPU job often failed from time to time, here is the
informations:
-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.5
Source code file:
/home/albert/install/source_code/gromacs-4.6.5/src/mdlib/nbnxn_cuda/nbnxn_cuda.cu,
line: 591
Fatal error:
cudaStreamSynchronize failed in cu_blockwait_nb: unspecified launch failure
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
"RTFM" (B. Hess)
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 1 out of 4
gcq#261: "RTFM" (B. Hess)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6097 on
node node3 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
I use -cpi -append option to restart the job, but it still failed 1-2
days later with the same messages.
thank you very much
best
Albert
More information about the gromacs.org_gmx-users
mailing list