[gmx-users] problem to restart REMD

leila salimi leilasalimi at gmail.com
Sun Jun 28 13:53:26 CEST 2015


Dear Mark,

I have a question about the updatedtate file, I run the simulation with 6
replicas now and the state*.cpt files are not updated after 1.5 ns and it
seems strange for me!
I would like to know how long the restart files are updated?

Regards,
Leila

On Fri, Jun 26, 2015 at 9:49 PM, leila salimi <leilasalimi at gmail.com> wrote:

> Dear Micholas,
> I agree with you! I am trying to find what is wrong with restarting this
> system!
> I am sure that if I start from begging It will stop at this step and stuck!
>
> I checked every thing seems fine but REMD is not working!
> Now I am trying to run only the first 5 repilcas and to see that is it
> passing the step or not!
>
> I will tell you my finding.
>
> Leila
>
> On Fri, Jun 26, 2015 at 9:16 PM, Smith, Micholas D. <smithmd at ornl.gov>
> wrote:
>
>> Leila, your error is interesting, as I have had a very similar
>> MPI_Allreduce error when I try to restart a large scale REMD. The first few
>> times the system restarted just fine, but at somepoint it fails.
>>
>> Out of curiousity, if we try to re-run from the beginning does it work?
>>
>> -Micholas
>>
>>
>> ===================
>> Micholas Dean Smith, PhD.
>> Post-doctoral Research Associate
>> University of Tennessee/Oak Ridge National Laboratory
>> Center for Molecular Biophysics
>>
>> ________________________________________
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <
>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of leila
>> salimi <leilasalimi at gmail.com>
>> Sent: Friday, June 26, 2015 1:30 PM
>> To: gmx-users at gromacs.org
>> Subject: Re: [gmx-users] problem to restart REMD
>>
>> Actually when I check for several times I checked the steps for all
>> state.cpt files and they are the same.
>> I try to restart it, it is run only for few steps ( It took only 3 minutes
>> ) and then it stopped with this lines in the error file :
>>
>> Abort(1) on node 12 (rank 12 in comm 1140850688): Fatal error in
>> MPI_Allreduce: Other MPI error, error stack:
>> MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fff8606aa00, count=4, MPI_DOUBLE, MPI_SUM, comm=0x84000001) failed
>> MPIR_Allreduce_impl(769).:
>> MPIR_Allreduce_intra(270):
>> MPIR_Bcast_impl(1462)....:
>> MPIR_Bcast(1486).........:
>> MPIR_Bcast_intra(1295)...:
>> MPIR_Bcast_binomial(252).: message sizes do not match across processes in
>> the collective routine: Received 64 but expected 32
>> ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in
>> task 12
>>
>> That I guess the problem is related to MPI, and I don't get why, because
>> my
>> other simulation is running well.
>>
>> Thanks for your suggestion.
>> Leila
>>
>> On Fri, Jun 26, 2015 at 7:10 PM, Mark Abraham <mark.j.abraham at gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I can't tell what you've done so that md0.log doesn't match, but that's
>> why
>> > I suggested you make a backup. You also don't have to have appending,
>> > that's just for convenience. The advice about node count mismatch
>> doesn't
>> > matter here... Use your judgement!
>> >
>> > Mark
>> >
>> > On Thu, 25 Jun 2015 16:23 leila salimi <leilasalimi at gmail.com> wrote:
>> >
>> > > Thanks very much. Ok I will check again, it seems that they are at the
>> > same
>> > > step!
>> > > only the thing that comes to my mind is that I used different number
>> of
>> > > cpus when I tried to update few steps for some replicas, and then I
>> used
>> > > the primary numbers of cpu that I used.
>> > >
>> > > Also I got this error when I update it the  some state.cpt
>> > > Fatal error:
>> > > Checksum wrong for 'md0.log'. The file has been replaced or its
>> contents
>> > > have been modified. Cannot do appending because of this condition.
>> > > For more information and tips for troubleshooting, please check the
>> > GROMACS
>> > > website at http://www.gromacs.org/Documentation/Errors
>> > >
>> > > and also this!
>> > >
>> > >  #nodes mismatch,
>> > >     current program: 2
>> > >     checkpoint file: 128
>> > >
>> > >   #PME-nodes mismatch,
>> > >     current program: -1
>> > >     checkpoint file: 32
>> > >
>> > > I hope to figure out this problem, otherwise I have to run it from
>> > > beginning!
>> > > Thanks!
>> > >
>> > > Leila
>> > >
>> > >
>> > >
>> > > On Thu, Jun 25, 2015 at 4:15 PM, Mark Abraham <
>> mark.j.abraham at gmail.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I can't tell either. Please run gmxcheck on all your input files, to
>> > > check
>> > > > the simulation part, time and step number are all what you think
>> they
>> > are
>> > > > (and that they match across the simulations) and try again.
>> > > >
>> > > > Mark
>> > > >
>> > > > On Thu, Jun 25, 2015 at 4:12 PM leila salimi <leilasalimi at gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Dear Mark,
>> > > > >
>> > > > > When I tried with new update of the state.cpt files, I got this
>> > error.
>> > > > >
>> > > > > Abort(1) on node 896 (rank 896 in comm 1140850688): Fatal error in
>> > > > > MPI_Allreduce: Message truncated, error stack:
>> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> > > > > rbuf=0x7ffc783af760, count=4, MPI_DOUBLE, MPI_SUM,
>> comm=0x84000002)
>> > > > failed
>> > > > > MPIR_Allreduce_impl(769).:
>> > > > > MPIR_Allreduce_intra(419):
>> > > > > MPIC_Sendrecv(467).......:
>> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received
>> but
>> > > > buffer
>> > > > > size is 32
>> > > > > Abort(1) on node 768 (rank 768 in comm 1140850688): Fatal error in
>> > > > > MPI_Allreduce: Message truncated, error stack:
>> > > > > MPI_Allreduce(912).......: MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> > > > > rbuf=0x7ffdba5176a0, count=4, MPI_DOUBLE, MPI_SUM,
>> comm=0x84000002)
>> > > > failed
>> > > > > MPIR_Allreduce_impl(769).:
>> > > > > MPIR_Allreduce_intra(419):
>> > > > > MPIC_Sendrecv(467).......:
>> > > > > MPIDI_Buffer_copy(73)....: Message truncated; 64 bytes received
>> but
>> > > > buffer
>> > > > > size is 32
>> > > > > ERROR: 0031-300  Forcing all remote tasks to exit due to exit
>> code 1
>> > in
>> > > > > task 896
>> > > > > "job.err.1011016.out" 399L, 17608C
>> > > > >
>> > > > > Actually I don't know what is the problem!
>> > > > >
>> > > > > Regards,
>> > > > > Leila
>> > > > >
>> > > > >
>> > > > > On Thu, Jun 18, 2015 at 12:00 AM, leila salimi <
>> > leilasalimi at gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > I understand what you meant, I run only few steps for the other
>> > > > replicas
>> > > > > > and then continue with the whole replicas.
>> > > > > > I hope every thing is going well.
>> > > > > >
>> > > > > > Thanks very much.
>> > > > > >
>> > > > > > On Wed, Jun 17, 2015 at 11:43 PM, leila salimi <
>> > > leilasalimi at gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> Thanks Mark for your suggestion.
>> > > > > >> Actually I don't understand the new two state6.cpt and
>> state7,cpt
>> > > > files,
>> > > > > >> because the time that it shows is  127670.062  !
>> > > > > >> That is strange! because my time step is 2 fs and I saved the
>> > output
>> > > > > >> every 250 steps, means every 500 fs. I expect the time should
>> be
>> > > like
>> > > > > >> 127670.000 or 127670.500 .
>> > > > > >>
>> > > > > >> By the way you mean with mdrun_mpi ... -nsteps ... , I can get
>> the
>> > > > steps
>> > > > > >> that I need for the old state.cpt files?
>> > > > > >>
>> > > > > >> Regards,
>> > > > > >> Leila
>> > > > > >>
>> > > > > >> On Wed, Jun 17, 2015 at 11:22 PM, Mark Abraham <
>> > > > > mark.j.abraham at gmail.com>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> Hi,
>> > > > > >>>
>> > > > > >>> That's all extremely strange. Given that you aren't going to
>> > > exchange
>> > > > > in
>> > > > > >>> that short period of time, you can probably do some arithmetic
>> > and
>> > > > work
>> > > > > >>> out
>> > > > > >>> how many steps you'd need to advance whichever set of files is
>> > > behind
>> > > > > the
>> > > > > >>> other. Then mdrun_mpi ... -nsteps y can write a set of
>> checkpoint
>> > > > files
>> > > > > >>> that will be all at the same time!
>> > > > > >>>
>> > > > > >>> Mark
>> > > > > >>>
>> > > > > >>> On Wed, Jun 17, 2015 at 10:18 PM leila salimi <
>> > > leilasalimi at gmail.com
>> > > > >
>> > > > > >>> wrote:
>> > > > > >>>
>> > > > > >>> > Hi Mark,
>> > > > > >>> >
>> > > > > >>> > Thanks very much. Unfortunately both the state6.cpt,
>> > > > state6_prev,cpt
>> > > > > >>> and
>> > > > > >>> > state7.cpt and state7_prev.cpt updated and their time are
>> > > different
>> > > > > >>> from
>> > > > > >>> > other replicas file (also with *_prev.cpt )!
>> > > > > >>> >
>> > > > > >>> > I am thinking maybe I can use init-step in mdp file, and
>> start
>> > > from
>> > > > > the
>> > > > > >>> > time that I have, because all trr files have the same time!
>> I
>> > > > checked
>> > > > > >>> with
>> > > > > >>> > gmxcheck. But I am not sure that I will get correct results!
>> > > > > >>> > Actually I got confused that with the mentioned Note, only
>> two
>> > > > > replicas
>> > > > > >>> > were running and the state file is changed and the others
>> not!
>> > > > > >>> >
>> > > > > >>> > ​regards,
>> > > > > >>> > Leila
>> > > > > >>> > --
>> > > > > >>> > Gromacs Users mailing list
>> > > > > >>> >
>> > > > > >>> > * Please search the archive at
>> > > > > >>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
>> > > before
>> > > > > >>> > posting!
>> > > > > >>> >
>> > > > > >>> > * Can't post? Read
>> > http://www.gromacs.org/Support/Mailing_Lists
>> > > > > >>> >
>> > > > > >>> > * For (un)subscribe requests visit
>> > > > > >>> >
>> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> > > > > or
>> > > > > >>> > send a mail to gmx-users-request at gromacs.org.
>> > > > > >>> --
>> > > > > >>> Gromacs Users mailing list
>> > > > > >>>
>> > > > > >>> * Please search the archive at
>> > > > > >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
>> > before
>> > > > > >>> posting!
>> > > > > >>>
>> > > > > >>> * Can't post? Read
>> http://www.gromacs.org/Support/Mailing_Lists
>> > > > > >>>
>> > > > > >>> * For (un)subscribe requests visit
>> > > > > >>>
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> > > > or
>> > > > > >>> send a mail to gmx-users-request at gromacs.org.
>> > > > > >>>
>> > > > > >>
>> > > > > >>
>> > > > > >
>> > > > > --
>> > > > > Gromacs Users mailing list
>> > > > >
>> > > > > * Please search the archive at
>> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
>> before
>> > > > > posting!
>> > > > >
>> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > > > >
>> > > > > * For (un)subscribe requests visit
>> > > > >
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> > or
>> > > > > send a mail to gmx-users-request at gromacs.org.
>> > > > --
>> > > > Gromacs Users mailing list
>> > > >
>> > > > * Please search the archive at
>> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > > > posting!
>> > > >
>> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > > >
>> > > > * For (un)subscribe requests visit
>> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>> > > > send a mail to gmx-users-request at gromacs.org.
>> > > >
>> > > --
>> > > Gromacs Users mailing list
>> > >
>> > > * Please search the archive at
>> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > > posting!
>> > >
>> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > >
>> > > * For (un)subscribe requests visit
>> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > > send a mail to gmx-users-request at gromacs.org.
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>
>


More information about the gromacs.org_gmx-users mailing list