[gmx-users] Restarting a REMD simulation (error)

João Henriques joao.henriques.32353 at gmail.com
Mon Apr 8 09:53:00 CEST 2013


Dear all,

Due to cluster wall-time limitations, I was forced to restart two REMD
simulations. It ran absolutely fine until hitting the wall-time. To restart
I used the following command:

mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
-multi 64 -replex 1000 -deffnm H5_ -cpi -noappend

(I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for
using it.)

Here is a random replica (#1) MPI output:

######START#######
NNODES=64, MYRANK=1, HOSTNAME=an091
NODEID=1 argc=11
Checkpoint file is from part 1, new output files will be suffixed part0002.
Reading file H5_1.tpr, VERSION 4.0.7 (single precision)

Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116

Fatal error:
The 64 subsystems are not compatible

-------------------------------------------------------

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 1 out of 64
######END#######

It's reading from the correct cpt and tpr files, so it must be something
else.

Here is a tail of the respective log file:

######START#######
Initializing Replica Exchange
Repl  There are 64 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ... OK
Multi-checking first exchange step: init_step/-replex ...
first exchange step: init_step/-replex is not equal for all subsystems
  subsystem 0: 3062
  subsystem 1: 3062
  subsystem 2: 3062
  subsystem 3: 3062
  subsystem 4: 3062
  subsystem 5: 3062
  subsystem 6: 3062
  subsystem 7: 3062
  subsystem 8: 3062
  subsystem 9: 3062
  subsystem 10: 3062
  subsystem 11: 3062
  subsystem 12: 3062
  subsystem 13: 3062
  subsystem 14: 3062
  subsystem 15: 3062
  subsystem 16: 3062
  subsystem 17: 3062
  subsystem 18: 3062
  subsystem 19: 3062
  subsystem 20: 3062
  subsystem 21: 3062
  subsystem 22: 3062
  subsystem 23: 3062
  subsystem 24: 3062
  subsystem 25: 3062
  subsystem 26: 3062
  subsystem 27: 3062
  subsystem 28: 3062
  subsystem 29: 3062
  subsystem 30: 3062
  subsystem 31: 3062
  subsystem 32: 3062
  subsystem 33: 3062
  subsystem 34: 3062
  subsystem 35: 3062
  subsystem 36: 3062
  subsystem 37: 3062
  subsystem 38: 3062
  subsystem 39: 3066
  subsystem 40: 3062
  subsystem 41: 3062
  subsystem 42: 3062
  subsystem 43: 3062
  subsystem 44: 3062
  subsystem 45: 3062
  subsystem 46: 3062
  subsystem 47: 3062
  subsystem 48: 3062
  subsystem 49: 3062
  subsystem 50: 3062
  subsystem 51: 3062
  subsystem 52: 3062
  subsystem 53: 3062
  subsystem 54: 3062
  subsystem 55: 3062
  subsystem 56: 3062
  subsystem 57: 3062
  subsystem 58: 3062
  subsystem 59: 3062
  subsystem 60: 3062
  subsystem 61: 3062
  subsystem 62: 3062
  subsystem 63: 3062

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116

Fatal error:
The 64 subsystems are not compatible

-------------------------------------------------------
######END#######

It's clear that "init_step/-replex is not equal for all subsystems" is the
problem, but does anyone know why this is happening and how to solve it?

Thank you for your patience,
Best regards,

João Henriques



More information about the gromacs.org_gmx-users mailing list