[gmx-users] Restarting a REMD simulation (error)

Mark Abraham mark.j.abraham at gmail.com
Mon Apr 8 15:17:41 CEST 2013


On Apr 8, 2013 8:53 AM, "João Henriques" <joao.henriques.32353 at gmail.com>
wrote:
>
> Dear all,
>
> Due to cluster wall-time limitations, I was forced to restart two REMD
> simulations. It ran absolutely fine until hitting the wall-time. To
restart
> I used the following command:
>
> mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
> -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend
>
> (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for
> using it.)
>
> Here is a random replica (#1) MPI output:
>
> ######START#######
> NNODES=64, MYRANK=1, HOSTNAME=an091
> NODEID=1 argc=11
> Checkpoint file is from part 1, new output files will be suffixed
part0002.
> Reading file H5_1.tpr, VERSION 4.0.7 (single precision)
>
> Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.0.7
> Source code file: main.c, line: 116
>
> Fatal error:
> The 64 subsystems are not compatible
>
> -------------------------------------------------------
>
> Error on node 1, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 1 out of 64
> ######END#######
>
> It's reading from the correct cpt and tpr files, so it must be something
> else.
>
> Here is a tail of the respective log file:
>
> ######START#######
> Initializing Replica Exchange
> Repl  There are 64 replicas:
> Multi-checking the number of atoms ... OK
> Multi-checking the integrator ... OK
> Multi-checking init_step+nsteps ... OK
> Multi-checking first exchange step: init_step/-replex ...
> first exchange step: init_step/-replex is not equal for all subsystems
>   subsystem 0: 3062
>   subsystem 1: 3062
>   subsystem 2: 3062
>   subsystem 3: 3062
>   subsystem 4: 3062
>   subsystem 5: 3062
>   subsystem 6: 3062
>   subsystem 7: 3062
>   subsystem 8: 3062
>   subsystem 9: 3062
>   subsystem 10: 3062
>   subsystem 11: 3062
>   subsystem 12: 3062
>   subsystem 13: 3062
>   subsystem 14: 3062
>   subsystem 15: 3062
>   subsystem 16: 3062
>   subsystem 17: 3062
>   subsystem 18: 3062
>   subsystem 19: 3062
>   subsystem 20: 3062
>   subsystem 21: 3062
>   subsystem 22: 3062
>   subsystem 23: 3062
>   subsystem 24: 3062
>   subsystem 25: 3062
>   subsystem 26: 3062
>   subsystem 27: 3062
>   subsystem 28: 3062
>   subsystem 29: 3062
>   subsystem 30: 3062
>   subsystem 31: 3062
>   subsystem 32: 3062
>   subsystem 33: 3062
>   subsystem 34: 3062
>   subsystem 35: 3062
>   subsystem 36: 3062
>   subsystem 37: 3062
>   subsystem 38: 3062
>   subsystem 39: 3066

Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062.
Back up your files. Use gmxcheck to see what's in files. Rename as suitable
so your set of files is consistent.

Mark

>   subsystem 40: 3062
>   subsystem 41: 3062
>   subsystem 42: 3062
>   subsystem 43: 3062
>   subsystem 44: 3062
>   subsystem 45: 3062
>   subsystem 46: 3062
>   subsystem 47: 3062
>   subsystem 48: 3062
>   subsystem 49: 3062
>   subsystem 50: 3062
>   subsystem 51: 3062
>   subsystem 52: 3062
>   subsystem 53: 3062
>   subsystem 54: 3062
>   subsystem 55: 3062
>   subsystem 56: 3062
>   subsystem 57: 3062
>   subsystem 58: 3062
>   subsystem 59: 3062
>   subsystem 60: 3062
>   subsystem 61: 3062
>   subsystem 62: 3062
>   subsystem 63: 3062
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.0.7
> Source code file: main.c, line: 116
>
> Fatal error:
> The 64 subsystems are not compatible
>
> -------------------------------------------------------
> ######END#######
>
> It's clear that "init_step/-replex is not equal for all subsystems" is the
> problem, but does anyone know why this is happening and how to solve it?
>
> Thank you for your patience,
> Best regards,
>
> João Henriques
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists



More information about the gromacs.org_gmx-users mailing list