[gmx-developers] Checkpoints & REMD

Wed Sep 7 10:16:36 CEST 2011

On 7/09/2011 5:29 PM, David van der Spoel wrote:
> Hi,
>
> I have been bitten by this problem before:
>
> [neolith1:native/REMD] % ls -l *cpt
> -rw-r--r-- 1 x_davva x_davva 635388 Sep  5 23:18 native10.cpt
> -rw-r--r-- 1 x_davva x_davva 635388 Sep  5 23:18 native10_prev.cpt
> -rw-r--r-- 1 x_davva x_davva      0 Sep  5 23:18 native11.cpt
> -rw-r--r-- 1 x_davva x_davva      0 Sep  5 23:18 native11_prev.cpt
>
> and now it happened again, using gmx 4.5.1 (for consistency). It seems 
> like the checkpoint code is not REMD or multisim aware, and hence the 
> code to check for the existence of xxx_prev.cpt is not sufficient.

The -cpi input string gets patched for multi-sim nearly first thing in 
mdrun. I'm not aware of any check for existence of the xxx_prev.cpt. 
It's a backup of the previous .cpt when writing a new checkpoint, but I 
can see no other place where GROMACS references it.

I also can't conceive of any way GROMACS could leave both those files of 
zero size. The backup is made before the new one is written, and if the 
original was of size zero then we could not have reached the point of 
writing a new checkpoint. Reading the checkpoint during a restart does 
not ever truncate or over-write the checkpoint file.

I don't know what happens when the set of input checkpoints are 
inconsistent.

>
> It seems that this problem happens due to the fact that my jobs are 
> chained in the queueing system, and will restart a new job even if the 
> previous job crashed. Hence the problem might be prevented by adding 
> extensive checks in the script for existence of cpt files and 
> consistency of those.
>
> Nevertheless it should be quite simple to introduce a multisim check 
> in the cpt code before the previous version is erased. Looking at the 
> latest (release-4-5-patches) source code this does not seem to be 
> present.

That can't help. Only after an inter-simulation signal is received is a 
checkpoint written (and the _prev.cpt over-written). If a single replica 
dies at some point, that signal can't proceed, and it is guaranteed that 
the set of checkpoint files that do exist can form a restart set. Only 
if a replica dies between the signal and the checkpoint at the next NS 
step is there the need to intervene manually to re-create a 
self-consistent set of .cpt files from the previous checkpoint (which 
can be formed from the _prev for the non-dying replicas and the current 
.cpt for the dying ones).

Berk's suggestion is a good one for helping the user identify mismatches.

Mark