[gmx-users] Restarting a REMD simulation (error)

João Henriques joao.henriques.32353 at gmail.com
Mon Apr 8 15:24:13 CEST 2013


Thank you very much. I didn't notice it until now considering all those
numbers look so similar. Great eye for detail!

João


On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham <mark.j.abraham at gmail.com>wrote:

> On Apr 8, 2013 8:53 AM, "João Henriques" <joao.henriques.32353 at gmail.com>
> wrote:
> >
> > Dear all,
> >
> > Due to cluster wall-time limitations, I was forced to restart two REMD
> > simulations. It ran absolutely fine until hitting the wall-time. To
> restart
> > I used the following command:
> >
> > mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
> > -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend
> >
> > (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons
> for
> > using it.)
> >
> > Here is a random replica (#1) MPI output:
> >
> > ######START#######
> > NNODES=64, MYRANK=1, HOSTNAME=an091
> > NODEID=1 argc=11
> > Checkpoint file is from part 1, new output files will be suffixed
> part0002.
> > Reading file H5_1.tpr, VERSION 4.0.7 (single precision)
> >
> > Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013
> >
> > -------------------------------------------------------
> > Program mdrun_mpi, VERSION 4.0.7
> > Source code file: main.c, line: 116
> >
> > Fatal error:
> > The 64 subsystems are not compatible
> >
> > -------------------------------------------------------
> >
> > Error on node 1, will try to stop all the nodes
> > Halting parallel program mdrun_mpi on CPU 1 out of 64
> > ######END#######
> >
> > It's reading from the correct cpt and tpr files, so it must be something
> > else.
> >
> > Here is a tail of the respective log file:
> >
> > ######START#######
> > Initializing Replica Exchange
> > Repl  There are 64 replicas:
> > Multi-checking the number of atoms ... OK
> > Multi-checking the integrator ... OK
> > Multi-checking init_step+nsteps ... OK
> > Multi-checking first exchange step: init_step/-replex ...
> > first exchange step: init_step/-replex is not equal for all subsystems
> >   subsystem 0: 3062
> >   subsystem 1: 3062
> >   subsystem 2: 3062
> >   subsystem 3: 3062
> >   subsystem 4: 3062
> >   subsystem 5: 3062
> >   subsystem 6: 3062
> >   subsystem 7: 3062
> >   subsystem 8: 3062
> >   subsystem 9: 3062
> >   subsystem 10: 3062
> >   subsystem 11: 3062
> >   subsystem 12: 3062
> >   subsystem 13: 3062
> >   subsystem 14: 3062
> >   subsystem 15: 3062
> >   subsystem 16: 3062
> >   subsystem 17: 3062
> >   subsystem 18: 3062
> >   subsystem 19: 3062
> >   subsystem 20: 3062
> >   subsystem 21: 3062
> >   subsystem 22: 3062
> >   subsystem 23: 3062
> >   subsystem 24: 3062
> >   subsystem 25: 3062
> >   subsystem 26: 3062
> >   subsystem 27: 3062
> >   subsystem 28: 3062
> >   subsystem 29: 3062
> >   subsystem 30: 3062
> >   subsystem 31: 3062
> >   subsystem 32: 3062
> >   subsystem 33: 3062
> >   subsystem 34: 3062
> >   subsystem 35: 3062
> >   subsystem 36: 3062
> >   subsystem 37: 3062
> >   subsystem 38: 3062
> >   subsystem 39: 3066
>
> Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062.
> Back up your files. Use gmxcheck to see what's in files. Rename as suitable
> so your set of files is consistent.
>
> Mark
>
> >   subsystem 40: 3062
> >   subsystem 41: 3062
> >   subsystem 42: 3062
> >   subsystem 43: 3062
> >   subsystem 44: 3062
> >   subsystem 45: 3062
> >   subsystem 46: 3062
> >   subsystem 47: 3062
> >   subsystem 48: 3062
> >   subsystem 49: 3062
> >   subsystem 50: 3062
> >   subsystem 51: 3062
> >   subsystem 52: 3062
> >   subsystem 53: 3062
> >   subsystem 54: 3062
> >   subsystem 55: 3062
> >   subsystem 56: 3062
> >   subsystem 57: 3062
> >   subsystem 58: 3062
> >   subsystem 59: 3062
> >   subsystem 60: 3062
> >   subsystem 61: 3062
> >   subsystem 62: 3062
> >   subsystem 63: 3062
> >
> > -------------------------------------------------------
> > Program mdrun_mpi, VERSION 4.0.7
> > Source code file: main.c, line: 116
> >
> > Fatal error:
> > The 64 subsystems are not compatible
> >
> > -------------------------------------------------------
> > ######END#######
> >
> > It's clear that "init_step/-replex is not equal for all subsystems" is
> the
> > problem, but does anyone know why this is happening and how to solve it?
> >
> > Thank you for your patience,
> > Best regards,
> >
> > João Henriques
> > --
> > gmx-users mailing list    gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > * Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
João Henriques



More information about the gromacs.org_gmx-users mailing list