[gmx-users] GPU job often stopped

Szilárd Páll szilard.pall at cbr.su.se
Mon Apr 29 13:19:31 CEST 2013


Have you tried running on CPUs only just to see if the issue persists?
Unless the issue does not occur with the same binary on the same
hardware running on CPUs only, I doubt it's a problem in the code.

Do you have ECC on?
--
Szilárd


On Sun, Apr 28, 2013 at 5:27 PM, Albert <mailmd2011 at gmail.com> wrote:
> Dear:
>
>   I am running MD jobs in a workstation with 4 K20 GPU and I found that the
> job always failed with following messages from time to time:
>
>
> [tesla:03432] *** Process received signal ***
> [tesla:03432] Signal: Segmentation fault (11)
> [tesla:03432] Signal code: Address not mapped (1)
> [tesla:03432] Failing at address: 0xfffffffe02de67e0
> [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)
> [0x7f4666da1cb0]
> [tesla:03432] [ 1] mdrun_mpi() [0x47dd61]
> [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae]
> [tesla:03432] [ 3]
> /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93)
> [0x7f46667904f3]
> [tesla:03432] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3432 on node tesla exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> I can continue the jobs with mdrun option "-append -cpi", but it still
> stopped from time to time. I am just wondering what's the problem?
>
> thank you very much
> Albert
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists



More information about the gromacs.org_gmx-users mailing list