[gmx-users] problem to restart REMD

Mark Abraham mark.j.abraham at gmail.com
Sun Jun 28 15:00:23 CEST 2015


Hi,

See mdrun -h regarding -cpt

Mark

On Sun, Jun 28, 2015 at 1:53 PM leila salimi <leilasalimi at gmail.com> wrote:

> Dear Mark,
>
> I have a question about the updatedtate file, I run the simulation with 6
> replicas now and the state*.cpt files are not updated after 1.5 ns and it
> seems strange for me!
> I would like to know how long the restart files are updated?
>
> Regards,
> Leila
>
> On Fri, Jun 26, 2015 at 9:49 PM, leila salimi <leilasalimi at gmail.com>
> wrote:
>
> > Dear Micholas,
> > I agree with you! I am trying to find what is wrong with restarting this
> > system!
> > I am sure that if I start from begging It will stop at this step and
> stuck!
> >
> > I checked every thing seems fine but REMD is not working!
> > Now I am trying to run only the first 5 repilcas and to see that is it
> > passing the step or not!
> >
> > I will tell you my finding.
> >
> > Leila
> >
> > On Fri, Jun 26, 2015 at 9:16 PM, Smith, Micholas D. <smithmd at ornl.gov>
> > wrote:
> >
> >> Leila, your error is interesting, as I have had a very similar
> >> MPI_Allreduce error when I try to restart a large scale REMD. The first
> few
> >> times the system restarted just fine, but at somepoint it fails.
> >>
> >> Out of curiousity, if we try to re-run from the beginning does it work?
> >>
> >> -Micholas
> >>
> >>
> >> ===================
> >> Micholas Dean Smith, PhD.
> >> Post-doctoral Research Associate
> >> University of Tennessee/Oak Ridge National Laboratory
> >> Center for Molecular Biophysics
> >>
> >> ________________________________________
> >> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> >> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of leila
> >> salimi <leilasalimi at gmail.com>
> >> Sent: Friday, June 26, 2015 1:30 PM
> >> To: gmx-users at gromacs.org
> >> Subject: Re: [gmx-users] problem to restart REMD
> >>
> >> Actually when I check for several times I checked the steps for all
> >> state.cpt files and they are the same.
> >> I try to restart it, it is run only for few steps ( It took only 3
> minutes
> >> ) and then it stopped with this lines in the error file :
> >>
> >> Abort(1) on node 12 (rank 12 in comm 1140850688): Fatal error in
> >> MPI_Allreduce: Other MPI error, error stack:
> >> MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> >> rbuf=0x7fff8606aa00, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000001)
> failed
> >> MPIR_Allreduce_impl(769).:
> >> MPIR_Allreduce_intra(270):
> >> MPIR_Bcast_impl(1462)....:
> >> MPIR_Bcast(1486).........:
> >> MPIR_Bcast_intra(1295)...:
> >> MPIR_Bcast_binomial(252).: message sizes do not match across processes
> in
> >> the collective routine: Received 64 but expected 32
> >> ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in
> >> task 12
> >>
> >> That I guess the problem is related to MPI, and I don't get why, because
> >> my
> >> other simulation is running well.
> >>
> >> Thanks for your suggestion.
> >> Leila
> >>
> >> On Fri, Jun 26, 2015 at 7:10 PM, Mark Abraham <mark.j.abraham at gmail.com
> >
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I can't tell what you've done so that md0.log doesn't match, but
> that's
> >> why
> >> > I suggested you make a backup. You also don't have to have appending,
> >> > that's just for convenience. The advice about node count mismatch
> >> doesn't
> >> > matter here... Use your judgement!
> >> >
> >> > Mark
> >> >
> >> > On Thu, 25 Jun 2015 16:23 leila salimi <leilasalimi at gmail.com> wrote:
> >> >
> >> > > Thanks very much. Ok I will check again, it seems that they are at
> the
> >> > same
> >> > > step!
> >> > > only the thing that comes to my mind is that I used different number
> >> of
> >> > > cpus when I tried to update few steps for some replicas, and then I
> >> used
> >> > > the primary numbers of cpu that I used.
> >> > >
> >> > > Also I got this error when I update it the  some state.cpt
> >> > > Fatal error:
> >> > > Checksum wrong for 'md0.log'. The file has been replaced or its
> >> contents
> >> > > have been modified. Cannot do appending because of this condition.
> >> > > For more information and tips for troubleshooting, please check the
> >> > GROMACS
> >> > > website at http://www.gromacs.org/Documentation/Errors
> >> > >
> >> > > and also this!
> >> > >
> >> > >  #nodes mismatch,
> >> > >     current program: 2
> >> > >     checkpoint file: 128
> >> > >
> >> > >   #PME-nodes mismatch,
> >> > >     current program: -1
> >> > >     checkpoint file: 32
> >> > >
> >> > > I hope to figure out this problem, otherwise I have to run it from
> >> > > beginning!
> >> > > Thanks!
> >> > >
> >> > > Leila
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Jun 25, 2015 at 4:15 PM, Mark Abraham <
> >> mark.j.abraham at gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > I can't tell either. Please run gmxcheck on all your input files,
> to
> >> > > check
> >> > > > the simulation part, time and step number are all what you think
> >> they
> >> > are
> >> > > > (and that they match across the simulations) and try again.
> >> > > >
> >> > > > Mark
> >> > > >
> >> > > > On Thu, Jun 25, 2015 at 4:12 PM leila salimi <
> leilasalimi at gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Dear Mark,
> >> > > > >
> >> > > > > When I tried with new update of the state.cpt files, I got this
> >> > error.
> >> > > > >
> >> > > > > Abort(1) on node 896 (rank 896 in comm 1140850688): Fatal error
> in
> >> > > > > MPI_Allreduce: Message truncated, error stack:
> >> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> >> > > > > rbuf=0x7ffc783af760, count=4, MPI_DOUBLE, MPI_SUM,
> >> comm=0x84000002)
> >> > > > failed
> >> > > > > MPIR_Allreduce_impl(769).:
> >> > > > > MPIR_Allreduce_intra(419):
> >> > > > > MPIC_Sendrecv(467).......:
> >> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received
> >> but
> >> > > > buffer
> >> > > > > size is 32
> >> > > > > Abort(1) on node 768 (rank 768 in comm 1140850688): Fatal error
> in
> >> > > > > MPI_Allreduce: Message truncated, error stack:
> >> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> >> > > > > rbuf=0x7ffdba5176a0, count=4, MPI_DOUBLE, MPI_SUM,
> >> comm=0x84000002)
> >> > > > failed
> >> > > > > MPIR_Allreduce_impl(769).:
> >> > > > > MPIR_Allreduce_intra(419):
> >> > > > > MPIC_Sendrecv(467).......:
> >> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received
> >> but
> >> > > > buffer
> >> > > > > size is 32
> >> > > > > ERROR: 0031-300  Forcing all remote tasks to exit due to exit
> >> code 1
> >> > in
> >> > > > > task 896
> >> > > > > "job.err.1011016.out" 399L, 17608C
> >> > > > >
> >> > > > > Actually I don't know what is the problem!
> >> > > > >
> >> > > > > Regards,
> >> > > > > Leila
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 18, 2015 at 12:00 AM, leila salimi <
> >> > leilasalimi at gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > I understand what you meant, I run only few steps for the
> other
> >> > > > replicas
> >> > > > > > and then continue with the whole replicas.
> >> > > > > > I hope every thing is going well.
> >> > > > > >
> >> > > > > > Thanks very much.
> >> > > > > >
> >> > > > > > On Wed, Jun 17, 2015 at 11:43 PM, leila salimi <
> >> > > leilasalimi at gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > >> Thanks Mark for your suggestion.
> >> > > > > >> Actually I don't understand the new two state6.cpt and
> >> state7,cpt
> >> > > > files,
> >> > > > > >> because the time that it shows is  127670.062  !
> >> > > > > >> That is strange! because my time step is 2 fs and I saved the
> >> > output
> >> > > > > >> every 250 steps, means every 500 fs. I expect the time should
> >> be
> >> > > like
> >> > > > > >> 127670.000 or 127670.500 .
> >> > > > > >>
> >> > > > > >> By the way you mean with mdrun_mpi ... -nsteps ... , I can
> get
> >> the
> >> > > > steps
> >> > > > > >> that I need for the old state.cpt files?
> >> > > > > >>
> >> > > > > >> Regards,
> >> > > > > >> Leila
> >> > > > > >>
> >> > > > > >> On Wed, Jun 17, 2015 at 11:22 PM, Mark Abraham <
> >> > > > > mark.j.abraham at gmail.com>
> >> > > > > >> wrote:
> >> > > > > >>
> >> > > > > >>> Hi,
> >> > > > > >>>
> >> > > > > >>> That's all extremely strange. Given that you aren't going to
> >> > > exchange
> >> > > > > in
> >> > > > > >>> that short period of time, you can probably do some
> arithmetic
> >> > and
> >> > > > work
> >> > > > > >>> out
> >> > > > > >>> how many steps you'd need to advance whichever set of files
> is
> >> > > behind
> >> > > > > the
> >> > > > > >>> other. Then mdrun_mpi ... -nsteps y can write a set of
> >> checkpoint
> >> > > > files
> >> > > > > >>> that will be all at the same time!
> >> > > > > >>>
> >> > > > > >>> Mark
> >> > > > > >>>
> >> > > > > >>> On Wed, Jun 17, 2015 at 10:18 PM leila salimi <
> >> > > leilasalimi at gmail.com
> >> > > > >
> >> > > > > >>> wrote:
> >> > > > > >>>
> >> > > > > >>> > Hi Mark,
> >> > > > > >>> >
> >> > > > > >>> > Thanks very much. Unfortunately both the state6.cpt,
> >> > > > state6_prev,cpt
> >> > > > > >>> and
> >> > > > > >>> > state7.cpt and state7_prev.cpt updated and their time are
> >> > > different
> >> > > > > >>> from
> >> > > > > >>> > other replicas file (also with *_prev.cpt )!
> >> > > > > >>> >
> >> > > > > >>> > I am thinking maybe I can use init-step in mdp file, and
> >> start
> >> > > from
> >> > > > > the
> >> > > > > >>> > time that I have, because all trr files have the same
> time!
> >> I
> >> > > > checked
> >> > > > > >>> with
> >> > > > > >>> > gmxcheck. But I am not sure that I will get correct
> results!
> >> > > > > >>> > Actually I got confused that with the mentioned Note, only
> >> two
> >> > > > > replicas
> >> > > > > >>> > were running and the state file is changed and the others
> >> not!
> >> > > > > >>> >
> >> > > > > >>> > ​regards,
> >> > > > > >>> > Leila
> >> > > > > >>> > --
> >> > > > > >>> > Gromacs Users mailing list
> >> > > > > >>> >
> >> > > > > >>> > * Please search the archive at
> >> > > > > >>> >
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> >> > > before
> >> > > > > >>> > posting!
> >> > > > > >>> >
> >> > > > > >>> > * Can't post? Read
> >> > http://www.gromacs.org/Support/Mailing_Lists
> >> > > > > >>> >
> >> > > > > >>> > * For (un)subscribe requests visit
> >> > > > > >>> >
> >> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >> > > > > or
> >> > > > > >>> > send a mail to gmx-users-request at gromacs.org.
> >> > > > > >>> --
> >> > > > > >>> Gromacs Users mailing list
> >> > > > > >>>
> >> > > > > >>> * Please search the archive at
> >> > > > > >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> >> > before
> >> > > > > >>> posting!
> >> > > > > >>>
> >> > > > > >>> * Can't post? Read
> >> http://www.gromacs.org/Support/Mailing_Lists
> >> > > > > >>>
> >> > > > > >>> * For (un)subscribe requests visit
> >> > > > > >>>
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >> > > > or
> >> > > > > >>> send a mail to gmx-users-request at gromacs.org.
> >> > > > > >>>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >
> >> > > > > --
> >> > > > > Gromacs Users mailing list
> >> > > > >
> >> > > > > * Please search the archive at
> >> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> >> before
> >> > > > > posting!
> >> > > > >
> >> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > > > >
> >> > > > > * For (un)subscribe requests visit
> >> > > > >
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >> > or
> >> > > > > send a mail to gmx-users-request at gromacs.org.
> >> > > > --
> >> > > > Gromacs Users mailing list
> >> > > >
> >> > > > * Please search the archive at
> >> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> before
> >> > > > posting!
> >> > > >
> >> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > > >
> >> > > > * For (un)subscribe requests visit
> >> > > >
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >> or
> >> > > > send a mail to gmx-users-request at gromacs.org.
> >> > > >
> >> > > --
> >> > > Gromacs Users mailing list
> >> > >
> >> > > * Please search the archive at
> >> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > > posting!
> >> > >
> >> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > >
> >> > > * For (un)subscribe requests visit
> >> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >> > > send a mail to gmx-users-request at gromacs.org.
> >> > --
> >> > Gromacs Users mailing list
> >> >
> >> > * Please search the archive at
> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > posting!
> >> >
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >
> >> > * For (un)subscribe requests visit
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> > send a mail to gmx-users-request at gromacs.org.
> >> >
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-request at gromacs.org.
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-request at gromacs.org.
> >>
> >
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list