[gmx-users] GPU job often stopped

Justin Lemkul jalemkul at vt.edu
Sun Apr 28 17:45:21 CEST 2013



On 4/28/13 11:27 AM, Albert wrote:
> Dear:
>
>    I am running MD jobs in a workstation with 4 K20 GPU and I found that the job
> always failed with following messages from time to time:
>
>
> [tesla:03432] *** Process received signal ***
> [tesla:03432] Signal: Segmentation fault (11)
> [tesla:03432] Signal code: Address not mapped (1)
> [tesla:03432] Failing at address: 0xfffffffe02de67e0
> [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f4666da1cb0]
> [tesla:03432] [ 1] mdrun_mpi() [0x47dd61]
> [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae]
> [tesla:03432] [ 3]
> /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f46667904f3]
> [tesla:03432] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3432 on node tesla exited on signal
> 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> I can continue the jobs with mdrun option "-append -cpi", but it still stopped
> from time to time. I am just wondering what's the problem?
>

Frequent failures suggest instability in the simulated system.  Check your .log 
file or stderr for informative Gromacs diagnostic information.

-Justin

-- 
========================================

Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list