[gmx-users] problem to restart REMD

leila salimi leilasalimi at gmail.com
Fri Jun 26 21:49:55 CEST 2015


Dear Micholas,
I agree with you! I am trying to find what is wrong with restarting this
system!
I am sure that if I start from begging It will stop at this step and stuck!

I checked every thing seems fine but REMD is not working!
Now I am trying to run only the first 5 repilcas and to see that is it
passing the step or not!

I will tell you my finding.

Leila

On Fri, Jun 26, 2015 at 9:16 PM, Smith, Micholas D. <smithmd at ornl.gov>
wrote:

> Leila, your error is interesting, as I have had a very similar
> MPI_Allreduce error when I try to restart a large scale REMD. The first few
> times the system restarted just fine, but at somepoint it fails.
>
> Out of curiousity, if we try to re-run from the beginning does it work?
>
> -Micholas
>
>
> ===================
> Micholas Dean Smith, PhD.
> Post-doctoral Research Associate
> University of Tennessee/Oak Ridge National Laboratory
> Center for Molecular Biophysics
>
> ________________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of leila
> salimi <leilasalimi at gmail.com>
> Sent: Friday, June 26, 2015 1:30 PM
> To: gmx-users at gromacs.org
> Subject: Re: [gmx-users] problem to restart REMD
>
> Actually when I check for several times I checked the steps for all
> state.cpt files and they are the same.
> I try to restart it, it is run only for few steps ( It took only 3 minutes
> ) and then it stopped with this lines in the error file :
>
> Abort(1) on node 12 (rank 12 in comm 1140850688): Fatal error in
> MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7fff8606aa00, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000001) failed
> MPIR_Allreduce_impl(769).:
> MPIR_Allreduce_intra(270):
> MPIR_Bcast_impl(1462)....:
> MPIR_Bcast(1486).........:
> MPIR_Bcast_intra(1295)...:
> MPIR_Bcast_binomial(252).: message sizes do not match across processes in
> the collective routine: Received 64 but expected 32
> ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in
> task 12
>
> That I guess the problem is related to MPI, and I don't get why, because my
> other simulation is running well.
>
> Thanks for your suggestion.
> Leila
>
> On Fri, Jun 26, 2015 at 7:10 PM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
>
> > Hi,
> >
> > I can't tell what you've done so that md0.log doesn't match, but that's
> why
> > I suggested you make a backup. You also don't have to have appending,
> > that's just for convenience. The advice about node count mismatch doesn't
> > matter here... Use your judgement!
> >
> > Mark
> >
> > On Thu, 25 Jun 2015 16:23 leila salimi <leilasalimi at gmail.com> wrote:
> >
> > > Thanks very much. Ok I will check again, it seems that they are at the
> > same
> > > step!
> > > only the thing that comes to my mind is that I used different number of
> > > cpus when I tried to update few steps for some replicas, and then I
> used
> > > the primary numbers of cpu that I used.
> > >
> > > Also I got this error when I update it the  some state.cpt
> > > Fatal error:
> > > Checksum wrong for 'md0.log'. The file has been replaced or its
> contents
> > > have been modified. Cannot do appending because of this condition.
> > > For more information and tips for troubleshooting, please check the
> > GROMACS
> > > website at http://www.gromacs.org/Documentation/Errors
> > >
> > > and also this!
> > >
> > >  #nodes mismatch,
> > >     current program: 2
> > >     checkpoint file: 128
> > >
> > >   #PME-nodes mismatch,
> > >     current program: -1
> > >     checkpoint file: 32
> > >
> > > I hope to figure out this problem, otherwise I have to run it from
> > > beginning!
> > > Thanks!
> > >
> > > Leila
> > >
> > >
> > >
> > > On Thu, Jun 25, 2015 at 4:15 PM, Mark Abraham <
> mark.j.abraham at gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I can't tell either. Please run gmxcheck on all your input files, to
> > > check
> > > > the simulation part, time and step number are all what you think they
> > are
> > > > (and that they match across the simulations) and try again.
> > > >
> > > > Mark
> > > >
> > > > On Thu, Jun 25, 2015 at 4:12 PM leila salimi <leilasalimi at gmail.com>
> > > > wrote:
> > > >
> > > > > Dear Mark,
> > > > >
> > > > > When I tried with new update of the state.cpt files, I got this
> > error.
> > > > >
> > > > > Abort(1) on node 896 (rank 896 in comm 1140850688): Fatal error in
> > > > > MPI_Allreduce: Message truncated, error stack:
> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> > > > > rbuf=0x7ffc783af760, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000002)
> > > > failed
> > > > > MPIR_Allreduce_impl(769).:
> > > > > MPIR_Allreduce_intra(419):
> > > > > MPIC_Sendrecv(467).......:
> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received but
> > > > buffer
> > > > > size is 32
> > > > > Abort(1) on node 768 (rank 768 in comm 1140850688): Fatal error in
> > > > > MPI_Allreduce: Message truncated, error stack:
> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
> > > > > rbuf=0x7ffdba5176a0, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000002)
> > > > failed
> > > > > MPIR_Allreduce_impl(769).:
> > > > > MPIR_Allreduce_intra(419):
> > > > > MPIC_Sendrecv(467).......:
> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received but
> > > > buffer
> > > > > size is 32
> > > > > ERROR: 0031-300  Forcing all remote tasks to exit due to exit code
> 1
> > in
> > > > > task 896
> > > > > "job.err.1011016.out" 399L, 17608C
> > > > >
> > > > > Actually I don't know what is the problem!
> > > > >
> > > > > Regards,
> > > > > Leila
> > > > >
> > > > >
> > > > > On Thu, Jun 18, 2015 at 12:00 AM, leila salimi <
> > leilasalimi at gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I understand what you meant, I run only few steps for the other
> > > > replicas
> > > > > > and then continue with the whole replicas.
> > > > > > I hope every thing is going well.
> > > > > >
> > > > > > Thanks very much.
> > > > > >
> > > > > > On Wed, Jun 17, 2015 at 11:43 PM, leila salimi <
> > > leilasalimi at gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks Mark for your suggestion.
> > > > > >> Actually I don't understand the new two state6.cpt and
> state7,cpt
> > > > files,
> > > > > >> because the time that it shows is  127670.062  !
> > > > > >> That is strange! because my time step is 2 fs and I saved the
> > output
> > > > > >> every 250 steps, means every 500 fs. I expect the time should be
> > > like
> > > > > >> 127670.000 or 127670.500 .
> > > > > >>
> > > > > >> By the way you mean with mdrun_mpi ... -nsteps ... , I can get
> the
> > > > steps
> > > > > >> that I need for the old state.cpt files?
> > > > > >>
> > > > > >> Regards,
> > > > > >> Leila
> > > > > >>
> > > > > >> On Wed, Jun 17, 2015 at 11:22 PM, Mark Abraham <
> > > > > mark.j.abraham at gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi,
> > > > > >>>
> > > > > >>> That's all extremely strange. Given that you aren't going to
> > > exchange
> > > > > in
> > > > > >>> that short period of time, you can probably do some arithmetic
> > and
> > > > work
> > > > > >>> out
> > > > > >>> how many steps you'd need to advance whichever set of files is
> > > behind
> > > > > the
> > > > > >>> other. Then mdrun_mpi ... -nsteps y can write a set of
> checkpoint
> > > > files
> > > > > >>> that will be all at the same time!
> > > > > >>>
> > > > > >>> Mark
> > > > > >>>
> > > > > >>> On Wed, Jun 17, 2015 at 10:18 PM leila salimi <
> > > leilasalimi at gmail.com
> > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Hi Mark,
> > > > > >>> >
> > > > > >>> > Thanks very much. Unfortunately both the state6.cpt,
> > > > state6_prev,cpt
> > > > > >>> and
> > > > > >>> > state7.cpt and state7_prev.cpt updated and their time are
> > > different
> > > > > >>> from
> > > > > >>> > other replicas file (also with *_prev.cpt )!
> > > > > >>> >
> > > > > >>> > I am thinking maybe I can use init-step in mdp file, and
> start
> > > from
> > > > > the
> > > > > >>> > time that I have, because all trr files have the same time! I
> > > > checked
> > > > > >>> with
> > > > > >>> > gmxcheck. But I am not sure that I will get correct results!
> > > > > >>> > Actually I got confused that with the mentioned Note, only
> two
> > > > > replicas
> > > > > >>> > were running and the state file is changed and the others
> not!
> > > > > >>> >
> > > > > >>> > ​regards,
> > > > > >>> > Leila
> > > > > >>> > --
> > > > > >>> > Gromacs Users mailing list
> > > > > >>> >
> > > > > >>> > * Please search the archive at
> > > > > >>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > > before
> > > > > >>> > posting!
> > > > > >>> >
> > > > > >>> > * Can't post? Read
> > http://www.gromacs.org/Support/Mailing_Lists
> > > > > >>> >
> > > > > >>> > * For (un)subscribe requests visit
> > > > > >>> >
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > > or
> > > > > >>> > send a mail to gmx-users-request at gromacs.org.
> > > > > >>> --
> > > > > >>> Gromacs Users mailing list
> > > > > >>>
> > > > > >>> * Please search the archive at
> > > > > >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > > >>> posting!
> > > > > >>>
> > > > > >>> * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > > > >>>
> > > > > >>> * For (un)subscribe requests visit
> > > > > >>>
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > or
> > > > > >>> send a mail to gmx-users-request at gromacs.org.
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > > --
> > > > > Gromacs Users mailing list
> > > > >
> > > > > * Please search the archive at
> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > > posting!
> > > > >
> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >
> > > > > * For (un)subscribe requests visit
> > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > > send a mail to gmx-users-request at gromacs.org.
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list