[gmx-users] Multi-node GPU runs crashing with a fork() warning

Thu May 22 17:31:06 CEST 2014

Hey,

Yes, everything runs fine if I work on one node with one or more GPU's. The
crash occurs, similar to the previous mailing list post:

http://comments.gmane.org/gmane.science.biology.gromacs.user/63911

It crashes when we attempt to work across multiple GPU enabled nodes. This
happens when our cluster of C2050's is used but not when our nodes with
K20's are used. The CPU architecture on the two node types are also
slightly different but both are Intel hardware.

Hopefully there is a solution.

Thanks for the tips on launch configurations. I think my simulations are
scaling pretty well with the GPU's.  I'm working with a dense city of many
millions of atoms.

On Wed, May 21, 2014 at 8:52 PM, Szilárd Páll <pall.szilard at gmail.com>wrote:

> Hi,
>
> Sounds like an MPI or MPI+CUDA issue. Does mdrun run if you use a
> single GPU? How about two?
>
> Btw, unless you have some rather exotic setup, you won't be able to
> get much improvement from using more than three, at most four GPUs per
> node - you need CPU cores to match them (and a large system to deed
> the GPUs and scale). Multi runs could work well, though.
>
> Cheers,
>
>
> --
> Szilárd
>
>
> On Wed, May 21, 2014 at 6:29 PM, Thomas C. O'Connor <toconnor at jhu.edu>
> wrote:
> > Hey Folks,
> >
> > I'm attempting to run simulations on a multi-node gpu cluster and my
> > simulations are crashing after flagging a open-mpi fork() warning:
> >
> >
> *------------------------------------------------------------------------------------------*
> > *An MPI process has executed an operation involving a call to the*
> > *"fork()" system call to create a child process.  Open MPI is currently*
> > *operating in a condition that could result in memory corruption or*
> > *other system errors; your MPI job may hang, crash, or produce silent*
> > *data corruption.  The use of fork() (or system() or other calls that*
> > *create child processes) is strongly discouraged.*
> >
> > *The process that invoked fork was:*
> >
> > *  Local host:          lngpu019 (PID 11549)*
> > *  MPI_COMM_WORLD rank: 18*
> >
> > *If you are *absolutely sure* that your application will successfully*
> > *and correctly survive a call to fork(), you may disable this warning*
> > *by setting the mpi_warn_on_fork MCA parameter to 0.*
> >
> *------------------------------------------------------------------------------------------*
> >
> >
> > I saw a similar mailing-list post about this sort of issue from September
> > 2013, but the thread had no resolution.
> >
> >
> >    - Each node of our cluster has has 12 intel cores and 6 NVIDIA Tesla
> >    C2050 GPU's.
> >
> >
> >    - we call: mpirun -machinefile nodes.txt -npernode 6 mdrun_mpi
> >
> >
> >    - I compiled GROMACS on one of the compute nodes with the C2050's.
> >
> > We also have a few nodes with newer K20 NVIDIA GPU's. When we compile
> > GROMACS on these nodes we can run the code across multiple nodes and
> GPU's
> > without any errors.
> >
> > I don't know if the fork() error is directly related to the crash or not;
> > or if there might be obscure, device specific object files outside my
> build
> > directory, that I should delete. Any insight you folks could provide to
> > help me solve this issue would be appreciated.
> >
> > Thanks,
> >
> >
> > --
> > Thomas O'Connor
> > Graduate Research Assistant
> > MCS IGERT Fellow
> >
> > Department of Physics & Astronomy
> > The Johns Hopkins University
> > 3701 San Martin Drive
> > Baltimore, MD 21218*toconnor at jhu.edu <toconnor at jhu.edu>*410.516.8587
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>

-- 
Thomas O'Connor
Graduate Research Assistant
MCS IGERT Fellow

Department of Physics & Astronomy
The Johns Hopkins University
3701 San Martin Drive
Baltimore, MD 21218*toconnor at jhu.edu <toconnor at jhu.edu>*410.516.8587