[gmx-users] Restarting a REMD simulation (error)
João Henriques
joao.henriques.32353 at gmail.com
Mon Apr 8 09:53:00 CEST 2013
Dear all,
Due to cluster wall-time limitations, I was forced to restart two REMD
simulations. It ran absolutely fine until hitting the wall-time. To restart
I used the following command:
mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
-multi 64 -replex 1000 -deffnm H5_ -cpi -noappend
(I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for
using it.)
Here is a random replica (#1) MPI output:
######START#######
NNODES=64, MYRANK=1, HOSTNAME=an091
NODEID=1 argc=11
Checkpoint file is from part 1, new output files will be suffixed part0002.
Reading file H5_1.tpr, VERSION 4.0.7 (single precision)
Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013
-------------------------------------------------------
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116
Fatal error:
The 64 subsystems are not compatible
-------------------------------------------------------
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 1 out of 64
######END#######
It's reading from the correct cpt and tpr files, so it must be something
else.
Here is a tail of the respective log file:
######START#######
Initializing Replica Exchange
Repl There are 64 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ... OK
Multi-checking first exchange step: init_step/-replex ...
first exchange step: init_step/-replex is not equal for all subsystems
subsystem 0: 3062
subsystem 1: 3062
subsystem 2: 3062
subsystem 3: 3062
subsystem 4: 3062
subsystem 5: 3062
subsystem 6: 3062
subsystem 7: 3062
subsystem 8: 3062
subsystem 9: 3062
subsystem 10: 3062
subsystem 11: 3062
subsystem 12: 3062
subsystem 13: 3062
subsystem 14: 3062
subsystem 15: 3062
subsystem 16: 3062
subsystem 17: 3062
subsystem 18: 3062
subsystem 19: 3062
subsystem 20: 3062
subsystem 21: 3062
subsystem 22: 3062
subsystem 23: 3062
subsystem 24: 3062
subsystem 25: 3062
subsystem 26: 3062
subsystem 27: 3062
subsystem 28: 3062
subsystem 29: 3062
subsystem 30: 3062
subsystem 31: 3062
subsystem 32: 3062
subsystem 33: 3062
subsystem 34: 3062
subsystem 35: 3062
subsystem 36: 3062
subsystem 37: 3062
subsystem 38: 3062
subsystem 39: 3066
subsystem 40: 3062
subsystem 41: 3062
subsystem 42: 3062
subsystem 43: 3062
subsystem 44: 3062
subsystem 45: 3062
subsystem 46: 3062
subsystem 47: 3062
subsystem 48: 3062
subsystem 49: 3062
subsystem 50: 3062
subsystem 51: 3062
subsystem 52: 3062
subsystem 53: 3062
subsystem 54: 3062
subsystem 55: 3062
subsystem 56: 3062
subsystem 57: 3062
subsystem 58: 3062
subsystem 59: 3062
subsystem 60: 3062
subsystem 61: 3062
subsystem 62: 3062
subsystem 63: 3062
-------------------------------------------------------
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116
Fatal error:
The 64 subsystems are not compatible
-------------------------------------------------------
######END#######
It's clear that "init_step/-replex is not equal for all subsystems" is the
problem, but does anyone know why this is happening and how to solve it?
Thank you for your patience,
Best regards,
João Henriques
More information about the gromacs.org_gmx-users
mailing list