[gmx-developers] Checkpoints & REMD

Wed Sep 7 12:12:12 CEST 2011

On 2011-09-07 09:56, Berk Hess wrote:
> Hi,
>
> Could you try the fix below.
>
> Berk
>
>
> diff --git a/src/kernel/mdrun.c b/src/kernel/mdrun.c
> index 8878331..7b1b396 100644
> --- a/src/kernel/mdrun.c
> +++ b/src/kernel/mdrun.c
> @@ -595,6 +595,11 @@ int main(int argc,char *argv[])
> {
> sim_part = sim_part_fn + 1;
> }
> +
> + if (MULTISIM(cr))
> + {
> + check_multi_int(stdout,cr->ms,sim_part,"simulation part");
> + }
> }
> else
> {
>
I have compiled it and queued my job.

What I think happens is this:
- mdrun starts in multisim mode
- checkpoint are copied to checkpoint_prev and new checkpoint files opened
- One simulation crashes and leaves an empty checkpoint file (prev still 
being OK)
- Job is restarted automatically using queueing system
- Now the empty checkpoint file is copied to the _prev file and then a 
new empty file is created.

In the non-multisim mode this is checked for I think... I don't see 
off-hand how the above patch would fix this but I am a bit out of touch.

Also, this problem is not reproducible in the sense that it depends on 
hitting a full disk or a bad node or so. So I'm not sure that my test 
will yield any results :(. Thanks for looking into it anyway.

>
> On 09/07/2011 09:48 AM, Berk Hess wrote:
>> Hi,
>>
>> The 0 size files are a general checkpointing, or better: file append
>> mode opening, bug,
>> which has been fixed for 4.5.5. There was another fix in an
>> intermediate version,
>> but in the current release-4-5-patches it should be completely fixed.
>>
>> Or are you referring to the problem that mdrun reads checkpoints for
>> some, but not all
>> replicas and does not realize this?
>> That should indeed be fixed.
>>
>> Berk
>>
>> On 09/07/2011 09:29 AM, David van der Spoel wrote:
>>> Hi,
>>>
>>> I have been bitten by this problem before:
>>>
>>> [neolith1:native/REMD] % ls -l *cpt
>>> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10.cpt
>>> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10_prev.cpt
>>> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11.cpt
>>> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11_prev.cpt
>>>
>>> and now it happened again, using gmx 4.5.1 (for consistency). It
>>> seems like the checkpoint code is not REMD or multisim aware, and
>>> hence the code to check for the existence of xxx_prev.cpt is not
>>> sufficient.
>>>
>>> It seems that this problem happens due to the fact that my jobs are
>>> chained in the queueing system, and will restart a new job even if
>>> the previous job crashed. Hence the problem might be prevented by
>>> adding extensive checks in the script for existence of cpt files and
>>> consistency of those.
>>>
>>> Nevertheless it should be quite simple to introduce a multisim check
>>> in the cpt code before the previous version is erased. Looking at the
>>> latest (release-4-5-patches) source code this does not seem to be
>>> present.
>>>
>>> Cheers,
>>
>

-- 
David van der Spoel, Ph.D., Professor of Biology
Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205.
spoel at xray.bmc.uu.se    http://folding.bmc.uu.se