[gmx-users] Multiple writing to same trajectory issue

Thu Oct 16 17:38:16 CEST 2014

On Thu, Oct 16, 2014 at 5:04 PM, Matthias Ernst <
matthias.ernst at physik.uni-freiburg.de> wrote:

> Dear users,
>
> I usually resume calculation of a trajectory from a checkpoint file
> appending to the old files (which, as I learned, is obviously not a good
> idea..).
>

It's brittle in the case of things going wrong... but if you want full
backups, or proper transactional semantics on your files, then you're going
to need to participate in the process. Various mdrun command-line options
are available to help with this, but we can't have "save every checkpoint"
as the default behaviour. Defaulting to mdrun -noappend is not great either
... "concatenate all your files before every analysis," or "provide all
files to every analysis tool," are not fun kinds of workflows for users (or
implementers).

Due to some issue with our queueing system, several jobs that were
> marked as depending on each other started at once and all wrote at the
> same trajectory file. I noticed this only 10 hours later.
>

Well, they opened the same file for writing, but what happens in such a
case surely depends on the file system implementation.

What does Gromacs do if a time frame is written to a trajectory that
> already has this time? Is it doubled or discarded?

mdrun doesn't even know. It says "write a frame" and if the file system
implements "everybody with the file open for writing can write to it in
some order," or "only the first opener can write," or "only the last opener
can write," or "only a random process can write," or "subsequent openers
cann't get a file handle so probably they get an error condition," then the
result conforms with whatever the implementation does. All kinds of
software share the same limitation, of course.

gmx check can perhaps help with the detective work about what did happen
and whether stuff is now broken.

And the same for the
> other files (log/edr...)?
>

Same, but whether the results on disk are mutually consistent depends on
the above...

I think using trjconv with -e option just until the timestep of the
> multiple writing should recover at least the data gained until then, but
> is there a possibility to recover and resume my run at that point?
>

Only if there is still a matching checkpoint file from that time, which
would only be true if the resumed runs didn't proceed very far, or you kept
a backup somehow.

Mark

>
> Thanks for your help,
> Matthias
>
>
> --
> Progress... http://www.phdcomics.com/comics/archive.php?comicid=1611
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>