[gmx-developers] Reproducible runs with DLB

Thu Jul 21 16:30:08 CEST 2011

On 22/07/2011 12:15 AM, Bogdan Costescu wrote:
> Dear GROMACS developers,
>
> I need to be able to restart from an earlier point in a simulation and
> exactly reproduce the original simulation while running in parallel
> with DD. Although I save the state of the simulation in a checkpoint
> file (using mdrun -cpnum), upon restart with the same number of ranks,
> there are differences, small at the beginning but which become larger
> later, which seem to appear due to the different DD cell sizes as they
> are modified by the dynamic load balancing (DLB). Turning DLB off
> (mdrun -dlb no) or running in reproducible mode (mdrun -reprod) makes
> the restart exactly reproduce the original (at least based on the
> criteria I'm interested in), however the run is significantly slower -
> the molecular system is not homogeneous, so DLB helps a lot in
> redistributing the calculations.
>
> If my understanding of the issue is correct, saving the state of the
> DD together with the checkpoint data and loading it upon restart would
> allow me to keep DLB enabled and exactly reproduce the original run.
> Is this so ?

Sounds right.

>   What are the difficulties in doing it ?

Extending the checkpoint file format is not programmer-friendly, never 
mind writing save-and-restore code for DD.

I suggest you look at the hidden options to mdrun that allow you to 
impose a particular DD grid that gives satisfactory performance. See 
"mdrun -h -hidden". You might have to reverse engineer how to use these 
from the code.

Mark

> If this is
> doable, is someone with a good understanding of DD willing to guide me
> in implementing it ? Of course, if someone with a good understanding
> of DD would be willing to implement it, I'd be more than glad to test
> it :-)
>
> Thanks in advance!
> Bogdan