[gmx-users] Multi-node GPU runs crashing with a fork() warning

Thu May 22 18:12:13 CEST 2014

On Thu, May 22, 2014 at 5:31 PM, Thomas C. O'Connor <toconnor at jhu.edu>wrote:

> Hey,
>
> Yes, everything runs fine if I work on one node with one or more GPU's. The
> crash occurs, similar to the previous mailing list post:
>
> http://comments.gmane.org/gmane.science.biology.gromacs.user/63911
>
> It crashes when we attempt to work across multiple GPU enabled nodes. This
> happens when our cluster of C2050's is used but not when our nodes with
> K20's are used. The CPU architecture on the two node types are also
> slightly different but both are Intel hardware.
>

I would suspect a difference in CUDA version, CUDA driver version, or MPI
version could be the origin here (see top of mdrun log file for some
information here). Normal use of mdrun does not use system() or fork() or
make child processes, which is the basis for my suspicion that the problem
is not within GROMACS. If there's a way to run some other MPI+GPU code on
the machine, whether it works might be useful information. Running mdrun
within a debugger will permit you to observe whether the crash is before
mdrun starts, or not.

> Hopefully there is a solution.
>
> Thanks for the tips on launch configurations. I think my simulations are
> scaling pretty well with the GPU's.  I'm working with a dense city of many
> millions of atoms.
>

Good, you'd want to be doing something like that.

Mark

>
> On Wed, May 21, 2014 at 8:52 PM, Szilárd Páll <pall.szilard at gmail.com
> >wrote:
>
> > Hi,
> >
> > Sounds like an MPI or MPI+CUDA issue. Does mdrun run if you use a
> > single GPU? How about two?
> >
> > Btw, unless you have some rather exotic setup, you won't be able to
> > get much improvement from using more than three, at most four GPUs per
> > node - you need CPU cores to match them (and a large system to deed
> > the GPUs and scale). Multi runs could work well, though.
> >
> > Cheers,
> >
> >
> > --
> > Szilárd
> >
> >
> > On Wed, May 21, 2014 at 6:29 PM, Thomas C. O'Connor <toconnor at jhu.edu>
> > wrote:
> > > Hey Folks,
> > >
> > > I'm attempting to run simulations on a multi-node gpu cluster and my
> > > simulations are crashing after flagging a open-mpi fork() warning:
> > >
> > >
> >
> *------------------------------------------------------------------------------------------*
> > > *An MPI process has executed an operation involving a call to the*
> > > *"fork()" system call to create a child process.  Open MPI is
> currently*
> > > *operating in a condition that could result in memory corruption or*
> > > *other system errors; your MPI job may hang, crash, or produce silent*
> > > *data corruption.  The use of fork() (or system() or other calls that*
> > > *create child processes) is strongly discouraged.*
> > >
> > > *The process that invoked fork was:*
> > >
> > > *  Local host:          lngpu019 (PID 11549)*
> > > *  MPI_COMM_WORLD rank: 18*
> > >
> > > *If you are *absolutely sure* that your application will successfully*
> > > *and correctly survive a call to fork(), you may disable this warning*
> > > *by setting the mpi_warn_on_fork MCA parameter to 0.*
> > >
> >
> *------------------------------------------------------------------------------------------*
> > >
> > >
> > > I saw a similar mailing-list post about this sort of issue from
> September
> > > 2013, but the thread had no resolution.
> > >
> > >
> > >    - Each node of our cluster has has 12 intel cores and 6 NVIDIA Tesla
> > >    C2050 GPU's.
> > >
> > >
> > >    - we call: mpirun -machinefile nodes.txt -npernode 6 mdrun_mpi
> > >
> > >
> > >    - I compiled GROMACS on one of the compute nodes with the C2050's.
> > >
> > > We also have a few nodes with newer K20 NVIDIA GPU's. When we compile
> > > GROMACS on these nodes we can run the code across multiple nodes and
> > GPU's
> > > without any errors.
> > >
> > > I don't know if the fork() error is directly related to the crash or
> not;
> > > or if there might be obscure, device specific object files outside my
> > build
> > > directory, that I should delete. Any insight you folks could provide to
> > > help me solve this issue would be appreciated.
> > >
> > > Thanks,
> > >
> > >
> > > --
> > > Thomas O'Connor
> > > Graduate Research Assistant
> > > MCS IGERT Fellow
> > >
> > > Department of Physics & Astronomy
> > > The Johns Hopkins University
> > > 3701 San Martin Drive
> > > Baltimore, MD 21218*toconnor at jhu.edu <toconnor at jhu.edu>*410.516.8587
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
>
>
>
> --
> Thomas O'Connor
> Graduate Research Assistant
> MCS IGERT Fellow
>
> Department of Physics & Astronomy
> The Johns Hopkins University
> 3701 San Martin Drive
> Baltimore, MD 21218*toconnor at jhu.edu <toconnor at jhu.edu>*410.516.8587
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>