[gmx-users] The 20 subsystems are not compatible (REMD)

Tue Nov 26 16:22:39 CET 2013

On Tue, Nov 26, 2013 at 2:53 PM, Pacho Ramos <pachoramos at gmail.com> wrote:

> Hello
>
> I am having a lot of problems to get a REMD simulation end, after running
> for some time, some replicas are interrupted without writting a state file,
> leading then to:
> The 20 subsystems are not compatible
>
> error on next run.
>
> I have run "gmxcheck -f" with all state files and I found that the time is
> different for two of them:
> - Most replicas have:
> Last frame         -1 time 14786.000
> - But replicas 16 and 17 have:
> Last frame         -1 time 14772.900
>
> I have looked at *prev states but they also differ:
> - Most of them have:
> Last frame         -1 time 14772.880
> - But replicas 16 and 17 have:
> Last frame         -1 time 14748.300
>
> As you can see, the *prev* from most replicas don't fit with the states for
> replicas 16 and 17 (14772.880 vs. 14772.900).
>

The combination of the current and _prev checkpoint files is supposed to
guarantee the existence of a set of .cpt files whose time stamp matches.
This should permit you to back up all your files, rename some files
appropriately and move on. (You can try this with your files, but the above
suggests it will not work.) This can get double-crossed if file systems do
not implement the standard flush-to-disk that they are supposed to do when
mdrun tells them to. But that should not lead to time stamps differing by
.02 ps. What GROMACS version is this? I don't recall such a bug, but if
this is with 4.5.5 or something, I would suggest you inspect the GROMACS
versions change log for clues this got fixed.

Looking at the log files I also see two differences:
> - Most of them end with:
> Replica exchange at step 7392900 time 14785.8
> Repl 0 <-> 1  dE_term = -3.084e+00 (kT)
> dplumed =  0.000e+00 dE_term = -3.084e+00 (kT)
> Repl ex  0 x  1    2 x  3    4 x  5    6 x  7    8 x  9   10 x 11   12 x 13
>   14 x 15   16 x 17   18   19
> Repl pr   1.0       1.0       1.0       1.0       1.0       .43       1.0
>     1.0       1.0       .14
>
>
> Step 7392990: Run time exceeded 11.385 hours, will terminate the run
>            Step           Time         Lambda
>         7393000    14786.00000        0.00000
>
> Writing checkpoint, step 7393000 at Tue Nov 26 13:01:52 2013
>
> - The offending replicas (16 and 17) end with:
> Step 7393000: Run time exceeded 11.385 hours, will terminate the run
>    Energies (kJ/mol)
>            Bond          Angle    Proper Dih.  Improper Dih.GB Polarization
>     2.41638e+03    6.14715e+03    4.95017e+03    4.27469e+02   -1.12985e+04
>   Nonpolar Sol.          LJ-14     Coulomb-14        LJ (SR)   Coulomb (SR)
>     3.30195e+02    2.08134e+03    3.34282e+04   -3.57626e+03   -4.28115e+04
>       Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
>    -7.90536e+03    9.75192e+03    1.84656e+03    2.77779e+05    4.65521e+02
>  Pressure (bar)
>     0.00000e+00
>

Hmm, that might explain the 0.02ps thing. That's probably nstlist*dt,
right? mdrun is supposed to communicate inter-simulation at the next
neighbour-search stage that at least one simulation has observed -maxh and
so all simulations should write a checkpoint at (IIRC) the *next*
neighbour-search step and exit. It's conceivable that a set of delayed
processors (e.g. local network contention) belonging to only a few replicas
could have matched an MPI message from a wrong time step. Proving that such
a bug exists and/or fixing it is normally a PITA, so we would only consider
looking into it if you've observed this in 4.6.x.

-> Looks like they got interrupted before writting the state file, leading
> to all this problems. But I don't know how to fix this situation and
> prevent it from occurring again in the future (currently, I ask for 12hour
> of processor and run mdrun with -maxh 11.5... maybe I should give it more
> time and run it with -maxh 11 to let it exit ok during 1 hour :/)
>

If the queue system uses job suspension, -maxh can get double-crossed, but
is probably not the issue here.

If my guess is right, then there's no way you can eliminate the possibility
of it occurring. Using mdrun -noappend will keep a full set of numbered
.cpt files, which will mitigate the loss in future, but you'll have to
manage concatenating your own output files old-school style. Or your job
scripts can back up the .cpt files between runs, so your maximum loss is a
single job submission.

Mark

> Thanks a lot for your help
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>