[gmx-developers] Checkpoints & REMD

Wed Sep 7 09:56:51 CEST 2011

Hi,

Could you try the fix below.

Berk

diff --git a/src/kernel/mdrun.c b/src/kernel/mdrun.c
index 8878331..7b1b396 100644
--- a/src/kernel/mdrun.c
+++ b/src/kernel/mdrun.c
@@ -595,6 +595,11 @@ int main(int argc,char *argv[])
        {
            sim_part = sim_part_fn + 1;
        }
+
+      if (MULTISIM(cr))
+      {
+          check_multi_int(stdout,cr->ms,sim_part,"simulation part");
+      }
    }
    else
    {


On 09/07/2011 09:48 AM, Berk Hess wrote:
> Hi,
>
> The 0 size files are a general checkpointing, or better: file append 
> mode opening, bug,
> which has been fixed for 4.5.5. There was another fix in an 
> intermediate version,
> but in the current release-4-5-patches it should be completely fixed.
>
> Or are you referring to the problem that mdrun reads checkpoints for 
> some, but not all
> replicas and does not realize this?
> That should indeed be fixed.
>
> Berk
>
> On 09/07/2011 09:29 AM, David van der Spoel wrote:
>> Hi,
>>
>> I have been bitten by this problem before:
>>
>> [neolith1:native/REMD] % ls -l *cpt
>> -rw-r--r-- 1 x_davva x_davva 635388 Sep  5 23:18 native10.cpt
>> -rw-r--r-- 1 x_davva x_davva 635388 Sep  5 23:18 native10_prev.cpt
>> -rw-r--r-- 1 x_davva x_davva      0 Sep  5 23:18 native11.cpt
>> -rw-r--r-- 1 x_davva x_davva      0 Sep  5 23:18 native11_prev.cpt
>>
>> and now it happened again, using gmx 4.5.1 (for consistency). It 
>> seems like the checkpoint code is not REMD or multisim aware, and 
>> hence the code to check for the existence of xxx_prev.cpt is not 
>> sufficient.
>>
>> It seems that this problem happens due to the fact that my jobs are 
>> chained in the queueing system, and will restart a new job even if 
>> the previous job crashed. Hence the problem might be prevented by 
>> adding extensive checks in the script for existence of cpt files and 
>> consistency of those.
>>
>> Nevertheless it should be quite simple to introduce a multisim check 
>> in the cpt code before the previous version is erased. Looking at the 
>> latest (release-4-5-patches) source code this does not seem to be 
>> present.
>>
>> Cheers,
>