[gmx-users] The 20 subsystems are not compatible (REMD)

Tue Nov 26 17:56:57 CET 2013

2013/11/26 Mark Abraham <mark.j.abraham at gmail.com>
>
> The combination of the current and _prev checkpoint files is supposed to
> guarantee the existence of a set of .cpt files whose time stamp matches.
> This should permit you to back up all your files, rename some files
> appropriately and move on. (You can try this with your files, but the above
> suggests it will not work.) This can get double-crossed if file systems do
> not implement the standard flush-to-disk that they are supposed to do when
> mdrun tells them to. But that should not lead to time stamps differing by
> .02 ps. What GROMACS version is this? I don't recall such a bug, but if
> this is with 4.5.5 or something, I would suggest you inspect the GROMACS
> versions change log for clues this got fixed.
>
>
Hi!

I also tried to rename old states for the other replicas but offending ones
(16 and 17) but it failed indeed :(

Regarding my gromacs version, I am running 4.6.2

About the filesystems, I don't know much about how the setup is done
because this is run on a external cluster that is administrated by other
people, I think they are using NFS but I don't know much about its concrete
configuration.

> Hmm, that might explain the 0.02ps thing. That's probably nstlist*dt,
> right?

Yes, I have a dt= 0.002 set, I guess we are referring to the same :)

mdrun is supposed to communicate inter-simulation at the next
> neighbour-search stage that at least one simulation has observed -maxh and
> so all simulations should write a checkpoint at (IIRC) the *next*
> neighbour-search step and exit. It's conceivable that a set of delayed
> processors (e.g. local network contention) belonging to only a few replicas
> could have matched an MPI message from a wrong time step. Proving that such
> a bug exists and/or fixing it is normally a PITA, so we would only consider
> looking into it if you've observed this in 4.6.x.
>

Yes, this is 4.6.2, I know there are a bit newer versions available but
this is the latest one compiled and ready for the cluster setup I am using
now

>
> -> Looks like they got interrupted before writting the state file, leading
> > to all this problems. But I don't know how to fix this situation and
> > prevent it from occurring again in the future (currently, I ask for
> 12hour
> > of processor and run mdrun with -maxh 11.5... maybe I should give it more
> > time and run it with -maxh 11 to let it exit ok during 1 hour :/)
> >
>
> If the queue system uses job suspension, -maxh can get double-crossed, but
> is probably not the issue here.
>

The queue system has job suspension, but I am launching jobs with the
option to be mailed when the job is suspended and doesn't look to be the
case

>
> If my guess is right, then there's no way you can eliminate the possibility
> of it occurring. Using mdrun -noappend will keep a full set of numbered
> .cpt files, which will mitigate the loss in future, but you'll have to
> manage concatenating your own output files old-school style. Or your job
> scripts can back up the .cpt files between runs, so your maximum loss is a
> single job submission.
>
> Mark

Umm, the solution of backing up state files periodically look interesting
indeed :/ Only a question about this: I guess that I would need then to
simply find the state files with the same time and rename them to XXX.cpt
for each replica and, then, restart simulation, right? Or should I do any
more to get the next mdrun run use the fixed (and older) state files and
continue from them discarding the new steps that their trajectory files
could have?

Thanks a lot