[gmx-users] Why does the -append option exist?

Sun Jun 5 04:31:03 CEST 2011

On Sat, Jun 4, 2011 at 9:09 PM, Mark Abraham <Mark.Abraham at anu.edu.au>wrote:

>  On 5/06/2011 3:11 AM, Dimitar Pachov wrote:
>
On Fri, Jun 3, 2011 at 9:24 PM, Mark Abraham <Mark.Abraham at anu.edu.au>
>  wrote:
>
>  On 4/06/2011 8:26 AM, Dimitar Pachov wrote
>>
>>
>  Here is an example:
>
>  ========================
>  [dpachov]$ ll -rth run1*  \#run1*
> -rw-rw-r-- 1 dpachov dpachov  11K May  2 02:59 run1.po.mdp
> -rw-rw-r-- 1 dpachov dpachov 4.6K May  2 02:59 run1.grompp.out
> -rw-rw-r-- 1 dpachov dpachov 3.5M May 13 19:09 run1.gro
> -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1.tpr
> -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1-i.tpr
> -rw-rw-r-- 1 dpachov dpachov    0 May 29 21:53 run1.trr
> -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1.cpt
> -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1_prev.cpt
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.xtc
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.edr
> -rw-rw-r-- 1 dpachov dpachov  15M Jun  3 17:03 run1.log
>  ========================
>
>  Submitted by:
> ========================
> ii=1
> ifmpi="mpirun -np $NSLOTS"
> --------
>    if [ ! -f run${ii}-i.tpr ];then
>        cp run${ii}.tpr run${ii}-i.tpr
>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>    fi
>
>     k=`ls md-${ii}*.out | wc -l`
>    outfile="md-${ii}-$k.out"
>    if [[ -f run${ii}.cpt ]]; then
>
>        $ifmpi `which mdrun` -s run${ii}.tpr -cpi run${ii}.cpt -v -deffnm
> run${ii} -npme 0 > $outfile  2>&1
>
>     fi
>  =========================
>
>
> This script is not using mdrun -append.
>

-append is the default, it doesn't need to be explicitly listed.

> Your original post suggested the use of -append was a problem. Why aren't
> we seeing a script with mdrun -append? Also, please provide the full script
> - it looks like there might be a loop around your tpbconv-then-mdrun
> fragment.
>

There is no loop; this is a job script with PBS directives. The header of it
looks like:
===========================
#!/bin/bash
#$ -S /bin/bash
#$ -pe mpich 8
#$ -ckpt reloc
#$ -l mem_total=6G
===========================

as usual submitted by:

qsub -N aaaa myjob.q

>
> Note that a useful trouble-shooting technique can be to construct your
> command line in a shell variable, echo it to stdout (redirected as suitable)
> and then execute the contents of the variable. Now, nobody has to parse a
> shell script to know what command line generated what output, and it can be
> co-located with the command's stdout.
>

I somewhat understand your point, but could give an example if you think it
is really necessary?

  <snip>
>
>  Writing checkpoint, step 51879590 at Tue May 31 10:45:22 2011
>     Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     8.33208e+03    4.72300e+03    5.31983e+02   -1.21532e+03    2.89586e+03
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>      3.00900e+04    9.31785e+04   -3.87790e+03   -7.40841e+05
> -8.36838e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -6.89867e+05    1.28721e+05   -5.61146e+05    2.99472e+02   -1.24229e+02
>  Pressure (bar)   Constr. rmsd
>    -1.03491e+02    2.99840e-05
>  ====================================
>
>
> So the -append restart looks like it did fine here.
>
>
>  Last output files from restarts:
> ====================================
>  [dpachov]$ ll -rth md-1-*out | tail -10
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:40 md-1-2428.out
>  -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:44 md-1-2429.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:46 md-1-2430.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:48 md-1-2431.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:50 md-1-2432.out
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 16:52 md-1-2433.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:55 md-1-2434.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:58 md-1-2435.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 17:03 md-1-2436.out
> *-rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 17:04 md-1-2437.out*
>  ====================================
> + around the time when the run1.xtc file seems to have been saved:
> ====================================
>  [dpachov]$ ll -rth md-1-23[5-6][0-9]*out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:37 md-1-2350.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:39 md-1-2351.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:43 md-1-2352.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:45 md-1-2353.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 13:46 md-1-2354.out
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 13:47 md-1-2355.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:49 md-1-2356.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:52 md-1-2357.out
> -rw-rw-r-- 1 dpachov dpachov  12K Jun  3 13:57 md-1-2358.out
> *-rw-rw-r-- 1 dpachov dpachov  12K Jun  3 14:02 md-1-2359.out*
> *-rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:03 md-1-2360.out*
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 14:06 md-1-2361.out
> -rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:09 md-1-2362.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:10 md-1-2363.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:11 md-1-2364.out
> -rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:12 md-1-2365.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:13 md-1-2366.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:14 md-1-2367.out
> -rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:17 md-1-2368.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:18 md-1-2369.out
>  ====================================
>
>
> I don't understand why you have so many restarts only a minute or two
> apart. Checkpoints are only written (by default) every 15 minutes, and no
> job seems to run that long, so all of these will start from the same point.
> If they're running simultaneously then it's conceivable that multiple
> processes trying to use the same output file could be a problem, as
> suggested by Jussi. You say that's not the case. So why are there so many
> restarts?
>

As I said, the queue is like this: you submit the job, it finds an empty
node, it goes there, however seconds later another user with
higher privileges on that particular node submits a job, his job kicks out
my job, mine goes on the queue again, it finds another empty node, goes
there, then another user with high privileges on that node submits a job,
which consequently kicks out my job again, and the cycle repeats itself ...
theoretically, it could continue forever, depending on how many and where
the empty nodes are, if any. These many restarts suggest that the queue was
full with relatively short jobs ran by users with high privileges.
Technically, I cannot see why the same processes should be running
simultaneously because at any instant my job runs only on one node, or it
stays in the queuing list.

>  From md-1-2360.out:
> =====================================
>  :::::::
>   Getting Loaded...
> Reading file run1.tpr, VERSION 4.5.4 (single precision)
>
>  Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011
>
>
>  Loaded with Money
>
>  Making 2D domain decomposition 4 x 2 x 1
>
>  WARNING: This run will generate roughly 4915 Mb of data
>
>  starting mdrun 'run1'
> 100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps).
>  =====================================
>
>
> These aren't showing anything other than that the restart is coming from
> the same point each time.
>
>
>  And from the last generated output md-1-2437.out (I think I killed the
> job at that point because of the above observed behavior):
> =====================================
>  :::::::
>  Getting Loaded...
> Reading file run1.tpr, VERSION 4.5.4 (single precision)
>  =====================================
>
>  I have at least 5-6 additional examples like this one. In some of them
> the *xtc file does have size greater than zero yet still very small, but it
> starts from some random frame (for example, in one of the cases it contains
> frames from ~91000ps to ~104000ps, but all frames before 91000ps are
> missing).
>
>
> I think that demonstrating a problem requires that the set of output files
> were fine before one particular restart, and weird afterwards. I don't think
> we've seen that yet.
>
>
I don't understand your point here. I am providing you with all info I have.
I am showing the output files of 3 restarts, and they are different in a
sense that the last two did not progress further enough before another job
restart occurred. The first was fine before the restart, and the others were
not exactly fine after the restart. At this point I realize that what I call
"restart" and what you call "restart" might be two different things. And
here is where the problem might be lying.

>
>  I realize there might be another problem, but the bottom line is that
> there is no mechanism that can prevent this from happening if many restarts
> are required, and particularly if the timing between these restarts is prone
> to be small (distributed computing could easily satisfy this condition).
>
>  Any suggestions, particularly related to the minimum resistance path to
> regenerate the missing data? :)
>
>
>
>>
>>
>>
>> Using the checkpoint capability & appending make sense when many restarts
>> are expected, but unfortunately it is exactly then when these options
>> completely fail! As a new user of Gromacs, I must say I am disappointed, and
>> would like to obtain an explanation of why the usage of these options is
>> clearly stated to be safe when it is not, and why the append option is the
>> default, and why at least a single warning has not been posted anywhere in
>> the docs & manuals?
>>
>>
>>  I can understand and sympathize with your frustration if you've
>> experienced the loss of a simulation. Do be careful when suggesting that
>> others' actions are blame-worthy, however.
>>
>
>  I have never suggested this. As a user, I am entitled to ask.
>
>
> Sure. However, talking about something that can "completely fail"
>

This is a fact, backed up by my evidences => I don't see anything bad
directed to anybody.

> which makes you "disappointed"
>

This is me being honest => again not related to anybody else.

> and wanting to "obtain an explanation"
>

Well, this even is funny :) - many people want this, especially in science.
Is that bad?

> about why something doesn't work as stated and lacks "a single warning"
>

Again a fact => again nothing bad here.

> suggests that someone has done something less than appropriate
>

This is a completely personal interpretation, and I am personally not
responsible of how people perceive information. For unknown to me reason you
moved into a very defensive mode. What could I do?

> , and so blame-worthy. It also assumes that the actions of a new user were
> correct, and the actions of a developer with long experience were not.
>

Sorry, this is too much. Where was this suggested? It seems to me you took
it too personally.

> This may or may not prove to be true. Starting such a discussion from a
> conciliatory (rather than antagonistic) stance is usually more productive.
> The shared objective should be to fix the problem, not prove that someone
> did something wrong.
>

Agree, and I did it. Again, your perception does not seem to be correlated
with my intended approach.

>
> An alternative way of wording your paragraph could have been:
> "Using the checkpoint capability & appending make sense when many restarts
> are expected, however I observe that under such circumstances this
> capability can fail. I am a new user of GROMACS, might I have been using
> them incorrectly? Are the developers aware of any situations under which the
> capability is unreliable? If so, should the default behaviour be different,
> and should this issue be documented somewhere?"
>

This is helpful, but again a bit too much. I don't tell you how to write,
please do the same.
Moreover, how could I ask questions the answers to which were mostly known
to me before sending my post?

>
>
>  And since my questions were not clearly answered, I will repeat them in a
> structured way:
>
>  1. Why is the usage of these options (-cpi and -append) clearly stated to
> be safe when in fact it is not?
>
>
> Because they are believed to be safe. Jussi's suggestion about file locking
> may have merit.
>
>
>  2. Why have you made the -append option the default in the most current
> GMX versions?
>
>
> Because it's the most convenient mode of operation.
>

>
>  3. Why has not a single warning been posted anywhere in the docs &
> manuals? (this question is somewhat clear - because you did not know about
> such a problem, but people say "ignorance of the law excuses no one",
> which means ignoring to put a warning for something that you were not 100%
> certain it would be error-free could not be an excuse)
>
>
> Because no-one is aware of a problem to warn about.
>

No, people are aware, they just do not think it is a problem, because there
is an easy work-around (-noappend), although not as convenient and clean.
Ask users of the Condor distributed grid using Gromacs.

>
>
>  I am blame-worthy - for blindly believing what was written in the manual
> without taking the necessary precautions. Lesson learned.
>
>
>> However, developers' time rarely permits addressing "feature X doesn't
>> work, why not?" in a productive way. Solving bugs can be hard, but will be
>> easier (and solved faster!) if the user who thinks a problem exists follows
>> good procedure. See http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
>>
>>
>  Implying that I did not follow a certain procedure related to a certain
> problem without you knowing what my initial intention was is just a
> speculation.
>
>
> I don't follow your point. If your intent is to get the problem being
> fixed, the advice on that web page is useful.
>

My intend was clearly stated before, but for the sake of clarification,
let's repeat it again:

1. To let you know about the existence of such a problem.
2. To find out why I encountered the problem, although I have read and
followed all of the Gromacs documentation related to the used by me
features.
3. To somewhat improve the way the documentation is written.

Pay attention to the fact that I did NOT have an incentive to get help in
solving my problem. There are always exist many work-arounds.

> If your intent is to prove someone else did something wrong then it's time
> to stop the discussion :-)
>

I did not want to prove anything besides that a problem existed. Logically,
problems are derived from somewhere, and that somewhere is not nowhere.

Of course, as I also mentioned, this is my problem => I did something wrong,
which I already stated. You imply I impose blame, which is again too
defensive of a statement, and hence I am just going to leave further
conclusions to you. :)

Thanks,
Dimitar

>
> Cheers,
>
> Mark
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110604/e6d56c18/attachment.html>