[gmx-users] problem to restart REMD
Smith, Micholas D.
smithmd at ornl.gov
Fri Jun 26 21:16:23 CEST 2015
Leila, your error is interesting, as I have had a very similar MPI_Allreduce error when I try to restart a large scale REMD. The first few times the system restarted just fine, but at somepoint it fails.
Out of curiousity, if we try to re-run from the beginning does it work?
-Micholas
===================
Micholas Dean Smith, PhD.
Post-doctoral Research Associate
University of Tennessee/Oak Ridge National Laboratory
Center for Molecular Biophysics
________________________________________
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of leila salimi <leilasalimi at gmail.com>
Sent: Friday, June 26, 2015 1:30 PM
To: gmx-users at gromacs.org
Subject: Re: [gmx-users] problem to restart REMD
Actually when I check for several times I checked the steps for all
state.cpt files and they are the same.
I try to restart it, it is run only for few steps ( It took only 3 minutes
) and then it stopped with this lines in the error file :
Abort(1) on node 12 (rank 12 in comm 1140850688): Fatal error in
MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fff8606aa00, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000001) failed
MPIR_Allreduce_impl(769).:
MPIR_Allreduce_intra(270):
MPIR_Bcast_impl(1462)....:
MPIR_Bcast(1486).........:
MPIR_Bcast_intra(1295)...:
MPIR_Bcast_binomial(252).: message sizes do not match across processes in
the collective routine: Received 64 but expected 32
ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in
task 12
That I guess the problem is related to MPI, and I don't get why, because my
other simulation is running well.
Thanks for your suggestion.
Leila
On Fri, Jun 26, 2015 at 7:10 PM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:
> Hi,
>
> I can't tell what you've done so that md0.log doesn't match, but that's why
> I suggested you make a backup. You also don't have to have appending,
> that's just for convenience. The advice about node count mismatch doesn't
> matter here... Use your judgement!
>
> Mark
>
> On Thu, 25 Jun 2015 16:23 leila salimi <leilasalimi at gmail.com> wrote:
>
> > Thanks very much. Ok I will check again, it seems that they are at the
> same
> > step!
> > only the thing that comes to my mind is that I used different number of
> > cpus when I tried to update few steps for some replicas, and then I used
> > the primary numbers of cpu that I used.
> >
> > Also I got this error when I update it the some state.cpt
> > Fatal error:
> > Checksum wrong for 'md0.log'. The file has been replaced or its contents
> > have been modified. Cannot do appending because of this condition.
> > For more information and tips for troubleshooting, please check the
> GROMACS
> > website at http://www.gromacs.org/Documentation/Errors
> >
> > and also this!
> >
> > #nodes mismatch,
> > current program: 2
> > checkpoint file: 128
> >
> > #PME-nodes mismatch,
> > current program: -1
> > checkpoint file: 32
> >
> > I hope to figure out this problem, otherwise I have to run it from
> > beginning!
> > Thanks!
> >
> > Leila
> >
> >
> >
> > On Thu, Jun 25, 2015 at 4:15 PM, Mark Abraham <mark.j.abraham at gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I can't tell either. Please run gmxcheck on all your input files, to
> > check
> > > the simulation part, time and step number are all what you think they
> are
> > > (and that they match across the simulations) and try again.
> > >
> > > Mark
> > >
> > > On Thu, Jun 25, 2015 at 4:12 PM leila salimi <leilasalimi at gmail.com>
> > > wrote:
> > >
> > > > Dear Mark,
> > > >
> > > > When I tried with new update of the state.cpt files, I got this
> error.
> > > >
> > > > Abort(1) on node 896 (rank 896 in comm 1140850688): Fatal error in
> > > > MPI_Allreduce: Message truncated, error stack:
> > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> > > > rbuf=0x7ffc783af760, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000002)
> > > failed
> > > > MPIR_Allreduce_impl(769).:
> > > > MPIR_Allreduce_intra(419):
> > > > MPIC_Sendrecv(467).......:
> > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received but
> > > buffer
> > > > size is 32
> > > > Abort(1) on node 768 (rank 768 in comm 1140850688): Fatal error in
> > > > MPI_Allreduce: Message truncated, error stack:
> > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> > > > rbuf=0x7ffdba5176a0, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000002)
> > > failed
> > > > MPIR_Allreduce_impl(769).:
> > > > MPIR_Allreduce_intra(419):
> > > > MPIC_Sendrecv(467).......:
> > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received but
> > > buffer
> > > > size is 32
> > > > ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1
> in
> > > > task 896
> > > > "job.err.1011016.out" 399L, 17608C
> > > >
> > > > Actually I don't know what is the problem!
> > > >
> > > > Regards,
> > > > Leila
> > > >
> > > >
> > > > On Thu, Jun 18, 2015 at 12:00 AM, leila salimi <
> leilasalimi at gmail.com>
> > > > wrote:
> > > >
> > > > > I understand what you meant, I run only few steps for the other
> > > replicas
> > > > > and then continue with the whole replicas.
> > > > > I hope every thing is going well.
> > > > >
> > > > > Thanks very much.
> > > > >
> > > > > On Wed, Jun 17, 2015 at 11:43 PM, leila salimi <
> > leilasalimi at gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Thanks Mark for your suggestion.
> > > > >> Actually I don't understand the new two state6.cpt and state7,cpt
> > > files,
> > > > >> because the time that it shows is 127670.062 !
> > > > >> That is strange! because my time step is 2 fs and I saved the
> output
> > > > >> every 250 steps, means every 500 fs. I expect the time should be
> > like
> > > > >> 127670.000 or 127670.500 .
> > > > >>
> > > > >> By the way you mean with mdrun_mpi ... -nsteps ... , I can get the
> > > steps
> > > > >> that I need for the old state.cpt files?
> > > > >>
> > > > >> Regards,
> > > > >> Leila
> > > > >>
> > > > >> On Wed, Jun 17, 2015 at 11:22 PM, Mark Abraham <
> > > > mark.j.abraham at gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> That's all extremely strange. Given that you aren't going to
> > exchange
> > > > in
> > > > >>> that short period of time, you can probably do some arithmetic
> and
> > > work
> > > > >>> out
> > > > >>> how many steps you'd need to advance whichever set of files is
> > behind
> > > > the
> > > > >>> other. Then mdrun_mpi ... -nsteps y can write a set of checkpoint
> > > files
> > > > >>> that will be all at the same time!
> > > > >>>
> > > > >>> Mark
> > > > >>>
> > > > >>> On Wed, Jun 17, 2015 at 10:18 PM leila salimi <
> > leilasalimi at gmail.com
> > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Hi Mark,
> > > > >>> >
> > > > >>> > Thanks very much. Unfortunately both the state6.cpt,
> > > state6_prev,cpt
> > > > >>> and
> > > > >>> > state7.cpt and state7_prev.cpt updated and their time are
> > different
> > > > >>> from
> > > > >>> > other replicas file (also with *_prev.cpt )!
> > > > >>> >
> > > > >>> > I am thinking maybe I can use init-step in mdp file, and start
> > from
> > > > the
> > > > >>> > time that I have, because all trr files have the same time! I
> > > checked
> > > > >>> with
> > > > >>> > gmxcheck. But I am not sure that I will get correct results!
> > > > >>> > Actually I got confused that with the mentioned Note, only two
> > > > replicas
> > > > >>> > were running and the state file is changed and the others not!
> > > > >>> >
> > > > >>> > regards,
> > > > >>> > Leila
> > > > >>> > --
> > > > >>> > Gromacs Users mailing list
> > > > >>> >
> > > > >>> > * Please search the archive at
> > > > >>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > >>> > posting!
> > > > >>> >
> > > > >>> > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > > >>> >
> > > > >>> > * For (un)subscribe requests visit
> > > > >>> >
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > or
> > > > >>> > send a mail to gmx-users-request at gromacs.org.
> > > > >>> --
> > > > >>> Gromacs Users mailing list
> > > > >>>
> > > > >>> * Please search the archive at
> > > > >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> before
> > > > >>> posting!
> > > > >>>
> > > > >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >>>
> > > > >>> * For (un)subscribe requests visit
> > > > >>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > or
> > > > >>> send a mail to gmx-users-request at gromacs.org.
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list