[gmx-users] Why does the -append option exist?

Dimitar Pachov dpachov at brandeis.edu
Tue Jun 7 23:21:34 CEST 2011


Hello,

Just a quick update after a few shorts tests we (my colleague and I) quickly
did. First, using

"*You can emulate this yourself by calling "sleep 10s" before mdrun and see
if that's long enough to solve the latency issue in your case.*"

doesn't work for a few reasons, mainly because it doesn't seem to be a
latency issue, but also because the load on a node is not affected by
"sleep".

However, you can reproduce the behavior I have observed pretty easily. It
seems to be related to the values of the pointers to the *xtc, *trr, *edr,
etc files written at the end of the checkpoint file after abrupt crashes AND
to the frequency of access (opening) to those files. How to test:

1. In your input *mdp file put a high frequency of saving coordinates to,
say, the *xtc (10, for example) and a low frequency for the *trr file
(10,000, for example).
2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run)
3. Kill abruptly the run shortly after that (say, after 10-100 steps).
4. You should have a few frames written in the *xtc file, and the only one
(the first) in the *trr file. The *cpt file should have different from zero
values for "file_offset_low" for all of these files (the pointers have been
updated).

5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay
attention that the frequency for accessing/writing the *trr has not been
reached.
7. You should have a few additional frames written in the *xtc file, while
the *trr will still have only 1 frame (the first). The *cpt file now has
updated all pointer values "file_offset_low", BUT the pointer to the *trr
has acquired a value of 0. Obviously, we already now what will happen if we
restart again from this last *cpt file.

8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
9. Kill it.
10. File *trr has size zero.


Therefore, if a run is killed before the files are accessed for writing
(depending on the chosen frequency), the file offset values reported in the
*cpt file doesn't seem to be accordingly updated, and hence a new restart
inevitably leads to overwritten output files.

Do you think this is fixable?

Thanks,
Dimitar






On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <roland at utk.edu> wrote:

> Two comments about the discussion:
>
> 1) I agree that buffered output (Kernel buffers - not application buffers)
> should not affect I/O. If it does it should be filed as bug to the OS. Maybe
> someone can write a short test application which tries to reproduce this
> idea. Thus writing to a file from one node and immediate after one test
> program is killed on one node writing to it from some other node.
>
> 2) We lock files but only the log file. The idea is that we only need
> to guarantee that the set of files is only accessed by one application. This
> seems safe but in case someone sees a way of how the trajectory is opened
> without the log file being opened, please file a bug.
>
> Roland
>
> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <Mark.Abraham at anu.edu.au>wrote:
>
>>  On 5/06/2011 11:08 PM, Francesco Oteri wrote:
>>
>> Dear Dimitar,
>> I'm following the debate regarding:
>>
>>
>>    The point was not "why" I was getting the restarts, but the fact
>> itself that I was getting restarts close in time, as I stated in my first
>> post. I actually also don't know whether jobs are deleted or suspended. I've
>> thought that a job returned back to the queue will basically start from the
>> beginning when later moved to an empty slot ... so don't understand the
>> difference from that perspective.
>>
>>
>> In the second mail yoo say:
>>
>>  Submitted by:
>> ========================
>> ii=1
>> ifmpi="mpirun -np $NSLOTS"
>> --------
>>    if [ ! -f run${ii}-i.tpr ];then
>>        cp run${ii}.tpr run${ii}-i.tpr
>>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>>    fi
>>
>>     k=`ls md-${ii}*.out | wc -l`
>>    outfile="md-${ii}-$k.out"
>>    if [[ -f run${ii}.cpt ]]; then
>>
>>       * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v
>> -deffnm run${ii} -npme 0 > $outfile  2>&1
>>
>>     fi
>>  =========================
>>
>>
>> If I understand well, you are submitting the SERIAL  mdrun. This means
>> that multiple instances of mdrun are running at the same time.
>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint
>> files, one for each instance (i.e. one for each CPU),  are written at the
>> same time.
>>
>>
>> Good thought, but Dimitar's stdout excerpts from early in the thread do
>> indicate the presence of multiple execution threads. Dynamic load balancing
>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally,
>> and by default in the installation process, the MPI-enabled binaries get an
>> "_mpi" suffix, but it isn't enforced - or enforceable :-)
>>
>> Mark
>>
>> --
>>
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
>
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
=====================================================
*Dimitar V Pachov*

PhD Physics
Postdoctoral Fellow
HHMI & Biochemistry Department        Phone: (781) 736-2326
Brandeis University, MS 057                Email: dpachov at brandeis.edu
=====================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110607/78baafff/attachment.html>


More information about the gromacs.org_gmx-users mailing list