[gmx-users] REMD stall out

Daniel Burns dburns at iastate.edu
Mon Feb 17 16:17:29 CET 2020


Thanks Mark and Szilard,

I forwarded Mark's suggestion to IT.  I'll see what they have to say and
then I'll try the simulation again and open an issue on redime.

Thank you,

Dan

On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham <mark.j.abraham at gmail.com>
wrote:

> Hi,
>
> That could be caused by configuration of the parallel file system or MPI on
> your cluster. If only one file descriptor is available per node to an MPI
> job, then your symptoms are explained. Some kinds of compute jobs follow
> such a model, so maybe someone optimized something for that.
>
> Mark
>
> On Mon, 17 Feb 2020 at 15:56, Daniel Burns <dburns at iastate.edu> wrote:
>
> > HI Szilard,
> >
> > I've deleted all my output but all the writing to the log and console
> stops
> > around the step noting the domain decomposition (or other preliminary
> > task).  It is the same with or without Plumed - the TREMD with Gromacs
> only
> > was the first thing to present this issue.
> >
> > I've discovered that if each replica is assigned its own node, the
> > simulations proceed.  If I try to run several replicas on each node
> > (divided evenly), the simulations stall out before any trajectories get
> > written.
> >
> > I have tried many different -np and -ntomp options as well as several
> slurm
> > job submission scripts with node/ thread configurations but multiple
> > simulations per node will not work.  I need to be able to run several
> > replicas on the same node to get enough data since it's hard to get more
> > than 8 nodes (and as a result, replicas).
> >
> > Thanks for your reply.
> >
> > -Dan
> >
> > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns <dburns at iastate.edu>
> wrote:
> >
> > > Hi,
> > >
> > > I continue to have trouble getting an REMD job to run.  It never makes
> it
> > > to the point that it generates trajectory files but it never gives any
> > > error either.
> > >
> > > I have switched from a large TREMD with 72 replicas to the Plumed
> > > Hamiltonian method with only 6 replicas.  Everything is now on one node
> > and
> > > each replica has 6 cores.  I've turned off the dynamic load balancing
> on
> > > this attempt per the recommendation from the Plumed site.
> > >
> > > Any ideas on how to troubleshoot?
> > >
> > > Thank you,
> > >
> > > Dan
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list