[gmx-developers] Checkpoints & REMD
Berk Hess
hess at cbr.su.se
Wed Sep 7 09:48:41 CEST 2011
Hi,
The 0 size files are a general checkpointing, or better: file append
mode opening, bug,
which has been fixed for 4.5.5. There was another fix in an intermediate
version,
but in the current release-4-5-patches it should be completely fixed.
Or are you referring to the problem that mdrun reads checkpoints for
some, but not all
replicas and does not realize this?
That should indeed be fixed.
Berk
On 09/07/2011 09:29 AM, David van der Spoel wrote:
> Hi,
>
> I have been bitten by this problem before:
>
> [neolith1:native/REMD] % ls -l *cpt
> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10.cpt
> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10_prev.cpt
> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11.cpt
> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11_prev.cpt
>
> and now it happened again, using gmx 4.5.1 (for consistency). It seems
> like the checkpoint code is not REMD or multisim aware, and hence the
> code to check for the existence of xxx_prev.cpt is not sufficient.
>
> It seems that this problem happens due to the fact that my jobs are
> chained in the queueing system, and will restart a new job even if the
> previous job crashed. Hence the problem might be prevented by adding
> extensive checks in the script for existence of cpt files and
> consistency of those.
>
> Nevertheless it should be quite simple to introduce a multisim check
> in the cpt code before the previous version is erased. Looking at the
> latest (release-4-5-patches) source code this does not seem to be
> present.
>
> Cheers,
More information about the gromacs.org_gmx-developers
mailing list