[gmx-developers] Reproducible runs with DLB

Thu Jul 21 16:15:43 CEST 2011

Dear GROMACS developers,

I need to be able to restart from an earlier point in a simulation and
exactly reproduce the original simulation while running in parallel
with DD. Although I save the state of the simulation in a checkpoint
file (using mdrun -cpnum), upon restart with the same number of ranks,
there are differences, small at the beginning but which become larger
later, which seem to appear due to the different DD cell sizes as they
are modified by the dynamic load balancing (DLB). Turning DLB off
(mdrun -dlb no) or running in reproducible mode (mdrun -reprod) makes
the restart exactly reproduce the original (at least based on the
criteria I'm interested in), however the run is significantly slower -
the molecular system is not homogeneous, so DLB helps a lot in
redistributing the calculations.

If my understanding of the issue is correct, saving the state of the
DD together with the checkpoint data and loading it upon restart would
allow me to keep DLB enabled and exactly reproduce the original run.
Is this so ? What are the difficulties in doing it ? If this is
doable, is someone with a good understanding of DD willing to guide me
in implementing it ? Of course, if someone with a good understanding
of DD would be willing to implement it, I'd be more than glad to test
it :-)

Thanks in advance!
Bogdan