[gmx-developers] Checkpoints & REMD

Mark Abraham Mark.Abraham at anu.edu.au
Wed Sep 7 22:48:44 CEST 2011


On 7/09/2011 8:12 PM, David van der Spoel wrote:
> On 2011-09-07 09:56, Berk Hess wrote:
>> Hi,
>>
>> Could you try the fix below.
>>
>> Berk
>>
>>
>> diff --git a/src/kernel/mdrun.c b/src/kernel/mdrun.c
>> index 8878331..7b1b396 100644
>> --- a/src/kernel/mdrun.c
>> +++ b/src/kernel/mdrun.c
>> @@ -595,6 +595,11 @@ int main(int argc,char *argv[])
>> {
>> sim_part = sim_part_fn + 1;
>> }
>> +
>> + if (MULTISIM(cr))
>> + {
>> + check_multi_int(stdout,cr->ms,sim_part,"simulation part");
>> + }
>> }
>> else
>> {
>>
> I have compiled it and queued my job.
>
> What I think happens is this:
> - mdrun starts in multisim mode
> - checkpoint are copied to checkpoint_prev and new checkpoint files 
> opened

That doesn't happen during a restart. That happens during writing a new 
checkpoint in write_checkpoint(). The .cpt file is only open for writing 
at this time, and never for appending. So the appending issue is not 
relevant at this point.

> - One simulation crashes and leaves an empty checkpoint file (prev 
> still being OK)

That can happen, but only after every replica has correctly started, and 
later communicated that it is time to checkpoint, and a replica crashes 
before all checkpoints are written - but a set of self-consistent data 
must still exist.

> - Job is restarted automatically using queueing system
> - Now the empty checkpoint file is copied to the _prev file and then a 
> new empty file is created.

Not unless your scripts are doing that :) If the file was empty on the 
restart, mdrun can't copy it because it can't get to the point of 
writing a new checkpoint.

mdrun could copy an empty file if file system buffering meant that the 
old checkpoint was still not flushed to disk, e.g. because the 
filesystem was full -cpt minutes ago. Other than computing a checksum 
before writing the old .cpt and checking for correctness later before 
writing the _prev file, I can't see a way around this possibility.

Mark

>
> In the non-multisim mode this is checked for I think... I don't see 
> off-hand how the above patch would fix this but I am a bit out of touch.
>
> Also, this problem is not reproducible in the sense that it depends on 
> hitting a full disk or a bad node or so. So I'm not sure that my test 
> will yield any results :(. Thanks for looking into it anyway.
>
>>
>> On 09/07/2011 09:48 AM, Berk Hess wrote:
>>> Hi,
>>>
>>> The 0 size files are a general checkpointing, or better: file append
>>> mode opening, bug,
>>> which has been fixed for 4.5.5. There was another fix in an
>>> intermediate version,
>>> but in the current release-4-5-patches it should be completely fixed.
>>>
>>> Or are you referring to the problem that mdrun reads checkpoints for
>>> some, but not all
>>> replicas and does not realize this?
>>> That should indeed be fixed.
>>>
>>> Berk
>>>
>>> On 09/07/2011 09:29 AM, David van der Spoel wrote:
>>>> Hi,
>>>>
>>>> I have been bitten by this problem before:
>>>>
>>>> [neolith1:native/REMD] % ls -l *cpt
>>>> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10.cpt
>>>> -rw-r--r-- 1 x_davva x_davva 635388 Sep 5 23:18 native10_prev.cpt
>>>> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11.cpt
>>>> -rw-r--r-- 1 x_davva x_davva 0 Sep 5 23:18 native11_prev.cpt
>>>>
>>>> and now it happened again, using gmx 4.5.1 (for consistency). It
>>>> seems like the checkpoint code is not REMD or multisim aware, and
>>>> hence the code to check for the existence of xxx_prev.cpt is not
>>>> sufficient.
>>>>
>>>> It seems that this problem happens due to the fact that my jobs are
>>>> chained in the queueing system, and will restart a new job even if
>>>> the previous job crashed. Hence the problem might be prevented by
>>>> adding extensive checks in the script for existence of cpt files and
>>>> consistency of those.
>>>>
>>>> Nevertheless it should be quite simple to introduce a multisim check
>>>> in the cpt code before the previous version is erased. Looking at the
>>>> latest (release-4-5-patches) source code this does not seem to be
>>>> present.
>>>>
>>>> Cheers,
>>>
>>
>
>




More information about the gromacs.org_gmx-developers mailing list