[gmx-users] Why does the -append option exist?

Sat Jun 4 22:57:04 CEST 2011

On Sat, Jun 4, 2011 at 1:50 PM, Rossen Apostolov <rossen at kth.se> wrote:

> Hi,
>
> On Jun 4, 2011, at 19:11, Dimitar Pachov <dpachov at brandeis.edu> wrote:
>
> By the way, is this ever reviewed:
>
> "Your mail to 'gmx-users' with the subject
>
>    Re: [gmx-users] Why does the -append option exist?
>
> Is being held until the list moderator can review it for approval."
>
>
> This message usually comes when e.g. one sends mails larger than 50K which
> are eventually discarded. If you need to send big attachments post a
> download link instead.
>

If they are eventually discarded, why doesn't the message says so? It's
confusing.
My message with all the quotes was 52Kb; no attachments. Anyway, I resent
it, but it appeared as a quote.

"*So you are referring to the case where you have multiple, independent*
*processes all using the same trajectory file. Yes, this will probably
*
*lead to problems, unless the trajectory file is somehow locked.*"

I don't not think so - during restart old processes should be killed and new
should be generated. If that's not the case, then you might be right.

Thanks,
Dimitar

>
> Cheers,
> Rossen
>
>
> On Fri, Jun 3, 2011 at 9:24 PM, Mark Abraham < <Mark.Abraham at anu.edu.au>
> Mark.Abraham at anu.edu.au> wrote:
>
>> On 4/06/2011 8:26 AM, Dimitar Pachov wrote
>>
>>
>> If this is true, then it wants fixing, and fast, and will get it :-)
>> However, it would be surprising for such a problem to exist and not have
>> been reported up to now. This feature has been in the code for a year now,
>> and while some minor issues have been fixed since the 4.5 release, it would
>> surprise me greatly if your claim was true.
>>
>> You're saying the equivalent of the steps below can occur:
>> 1. Simulation wanders along normally and writes a checkpoint at step 1003
>> 2. Random crash happens at step 1106
>> 3. An -append restart from the old .tpr and the recent .cpt file will
>> restart from step 1003
>> 4. Random crash happens at step 1059
>> 5. Now a restart doesn't restart from step 1003, but some other step
>>
>>
>> and most importantly, the most important piece of data, that being the
>> trajectory file, could be completely lost! I don't know the code behind the
>> checkpointing & appending, but I can see how easy one can overwrite 100ns
>> trajectories, for example, and "obtain" the same trajectories of size ....
>> 0.
>>
>>
>> I don't see how easy that is, without a concrete example, where user error
>> is not possible.
>>
>
> Here is an example:
>
> ========================
> [dpachov]$ ll -rth run1*  \#run1*
> -rw-rw-r-- 1 dpachov dpachov  11K May  2 02:59 run1.po.mdp
> -rw-rw-r-- 1 dpachov dpachov 4.6K May  2 02:59 run1.grompp.out
> -rw-rw-r-- 1 dpachov dpachov 3.5M May 13 19:09 run1.gro
> -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1.tpr
> -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1-i.tpr
> -rw-rw-r-- 1 dpachov dpachov    0 May 29 21:53 run1.trr
> -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1.cpt
> -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1_prev.cpt
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.xtc
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.edr
> -rw-rw-r-- 1 dpachov dpachov  15M Jun  3 17:03 run1.log
> ========================
>
> Submitted by:
> ========================
> ii=1
> ifmpi="mpirun -np $NSLOTS"
> --------
>    if [ ! -f run${ii}-i.tpr ];then
>       cp run${ii}.tpr run${ii}-i.tpr
>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>    fi
>
>    k=`ls md-${ii}*.out | wc -l`
>    outfile="md-${ii}-$k.out"
>    if [[ -f run${ii}.cpt ]]; then
>
>        $ifmpi `which mdrun` -s run${ii}.tpr -cpi run${ii}.cpt -v -deffnm
> run${ii} -npme 0 > $outfile  2>&1
>
>    fi
> =========================
>
> From the end of run1.log:
> =========================
> Started mdrun on node 0 Tue May 31 10:28:52 2011
>
>            Step           Time         Lambda
>        51879390   103758.78000        0.00000
>
>    Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     8.37521e+03    4.52303e+03    4.78633e+02   -1.23174e+03    2.87366e+03
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     3.02277e+04    9.48267e+04   -3.88596e+03   -7.43902e+05   -8.36436e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -6.91359e+05    1.29016e+05   -5.62342e+05    3.00159e+02   -1.24746e+02
>  Pressure (bar)   Constr. rmsd
>    -2.43143e+00    0.00000e+00
>
> DD  step 51879399 load imb.: force 225.5%
>
> Writing checkpoint, step 51879590 at Tue May 31 10:45:22 2011
>
>
>
>
> -----------------------------------------------------------
> Restarting from checkpoint, appending to previous log file.
>
> Log file opened on Fri Jun  3 17:03:20 2011
> Host: compute-1-13.local  pid: 337  nodeid: 0  nnodes:  8
> The Gromacs distribution was built Tue Mar 22 09:26:37 EDT 2011 by
> dpachov at login-0-0.local (Linux 2.6.18-194.17.1.el5xen x86_64)
>
> :::
> :::
> :::
>
> Grid: 13 x 15 x 11 cells
>  Initial temperature: 301.137 K
>
> Started mdrun on node 0 Fri Jun  3 13:58:07 2011
>
>            Step           Time         Lambda
>        51879590   103759.18000        0.00000
>
>    Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     8.47435e+03    4.61654e+03    3.99388e+02   -1.16765e+03    2.93920e+03
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     2.99294e+04    9.42035e+04   -3.87927e+03   -7.43250e+05   -8.35872e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -6.91322e+05    1.29433e+05   -5.61889e+05    3.01128e+02   -1.24317e+02
>  Pressure (bar)   Constr. rmsd
>    -2.18259e+00    0.00000e+00
>
> DD  step 51879599 load imb.: force 43.7%
>
> At step 51879600 the performance loss due to force load imbalance is 17.5 %
>
> NOTE: Turning on dynamic load balancing
>
> DD  step 51879999  vol min/aver 0.643  load imb.: force  0.4%
>
> ::
> ::
> ::
>
> DD  step 51884999  vol min/aver 0.647  load imb.: force  0.3%
>
>            Step           Time         Lambda
>        51885000   103770.00000        0.00000
>
>    Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     8.33208e+03    4.72300e+03    5.31983e+02   -1.21532e+03    2.89586e+03
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     3.00900e+04    9.31785e+04   -3.87790e+03   -7.40841e+05   -8.36838e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -6.89867e+05    1.28721e+05   -5.61146e+05    2.99472e+02   -1.24229e+02
>  Pressure (bar)   Constr. rmsd
>    -1.03491e+02    2.99840e-05
> ====================================
>
> Last output files from restarts:
> ====================================
> [dpachov]$ ll -rth md-1-*out | tail -10
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:40 md-1-2428.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:44 md-1-2429.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:46 md-1-2430.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:48 md-1-2431.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:50 md-1-2432.out
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 16:52 md-1-2433.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:55 md-1-2434.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:58 md-1-2435.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 17:03 md-1-2436.out
> *-rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 17:04 md-1-2437.out*
> ====================================
> + around the time when the run1.xtc file seems to have been saved:
> ====================================
> [dpachov]$ ll -rth md-1-23[5-6][0-9]*out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:37 md-1-2350.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:39 md-1-2351.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:43 md-1-2352.out
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:45 md-1-2353.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 13:46 md-1-2354.out
> -rw-rw-r-- 1 dpachov dpachov    0 Jun  3 13:47 md-1-2355.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:49 md-1-2356.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:52 md-1-2357.out
> -rw-rw-r-- 1 dpachov dpachov  12K Jun  3 13:57 md-1-2358.out
> *-rw-rw-r-- 1 dpachov dpachov  12K Jun  3 14:02 md-1-2359.out*
> *-rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:03 md-1-2360.out*
> -rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 14:06 md-1-2361.out
> -rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:09 md-1-2362.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:10 md-1-2363.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:11 md-1-2364.out
> -rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:12 md-1-2365.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:13 md-1-2366.out
> -rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:14 md-1-2367.out
> -rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:17 md-1-2368.out
> -rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:18 md-1-2369.out
> ====================================
>
> From md-1-2359.out:
> =====================================
> :::::::
> Getting Loaded...
> Reading file run1.tpr, VERSION 4.5.4 (single precision)
>
> Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011
>
>
> Loaded with Money
>
> Making 2D domain decomposition 4 x 2 x 1
>
> WARNING: This run will generate roughly 4915 Mb of data
>
> starting mdrun 'run1'
> 100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps).
> step 51879590, will finish Wed Aug 17 14:21:59 2011
> imb F 44%
> NOTE: Turning on dynamic load balancing
>
> step 51879600, will finish Fri Jul 15 14:00:00 2011
> vol 0.64  imb F  0% step 51879700, will finish Mon Jun 27 02:19:09 2011
> vol 0.63  imb F  0% step 51879800, will finish Sat Jun 25 15:14:01 2011
> vol 0.64  imb F  1% step 51879900, will finish Sat Jun 25 02:11:53 2011
> vol 0.64  imb F  0% step 51880000, will finish Fri Jun 24 19:48:54 2011
> vol 0.64  imb F  1% step 51880100, will finish Fri Jun 24 15:55:19 2011
> ::::::
> vol 0.67  imb F  0% step 51886400, will finish Fri Jun 24 02:51:45 2011
> vol 0.66  imb F  0% step 51886500, will finish Fri Jun 24 02:48:10 2011
> vol 0.66  imb F  0% step 51886600, will finish Fri Jun 24 02:47:33 2011
> =====================================
>
> From md-1-2360.out:
> =====================================
> :::::::
> Getting Loaded...
> Reading file run1.tpr, VERSION 4.5.4 (single precision)
>
> Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011
>
>
> Loaded with Money
>
> Making 2D domain decomposition 4 x 2 x 1
>
> WARNING: This run will generate roughly 4915 Mb of data
>
> starting mdrun 'run1'
> 100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps).
> =====================================
>
> And from the last generated output md-1-2437.out (I think I killed the job
> at that point because of the above observed behavior):
> =====================================
> :::::::
> Getting Loaded...
> Reading file run1.tpr, VERSION 4.5.4 (single precision)
> =====================================
>
> I have at least 5-6 additional examples like this one. In some of them the
> *xtc file does have size greater than zero yet still very small, but it
> starts from some random frame (for example, in one of the cases it contains
> frames from ~91000ps to ~104000ps, but all frames before 91000ps are
> missing).
>
> I realize there might be another problem, but the bottom line is that there
> is no mechanism that can prevent this from happening if many restarts are
> required, and particularly if the timing between these restarts is prone to
> be small (distributed computing could easily satisfy this condition).
>
> Any suggestions, particularly related to the minimum resistance path to
> regenerate the missing data? :)
>
>
>
>>
>>
>>
>> Using the checkpoint capability & appending make sense when many restarts
>> are expected, but unfortunately it is exactly then when these options
>> completely fail! As a new user of Gromacs, I must say I am disappointed, and
>> would like to obtain an explanation of why the usage of these options is
>> clearly stated to be safe when it is not, and why the append option is the
>> default, and why at least a single warning has not been posted anywhere in
>> the docs & manuals?
>>
>>
>> I can understand and sympathize with your frustration if you've
>> experienced the loss of a simulation. Do be careful when suggesting that
>> others' actions are blame-worthy, however.
>>
>
> I have never suggested this. As a user, I am entitled to ask. And since my
> questions were not clearly answered, I will repeat them in a structured way:
>
> 1. Why is the usage of these options (-cpi and -append) clearly stated to
> be safe when in fact it is not?
> 2. Why have you made the -append option the default in the most current GMX
> versions?
> 3. Why has not a single warning been posted anywhere in the docs & manuals?
> (this question is somewhat clear - because you did not know about such a
> problem, but people say "ignorance of the law excuses no one", which means
> ignoring to put a warning for something that you were not 100% certain it
> would be error-free could not be an excuse)
>
> I am blame-worthy - for blindly believing what was written in the manual
> without taking the necessary precautions. Lesson learned.
>
>
>> However, developers' time rarely permits addressing "feature X doesn't
>> work, why not?" in a productive way. Solving bugs can be hard, but will be
>> easier (and solved faster!) if the user who thinks a problem exists follows
>> good procedure. See <http://www.chiark.greenend.org.uk/~sgtatham/bugs.html>
>> http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
>>
>>
> Implying that I did not follow a certain procedure related to a certain
> problem without you knowing what my initial intention was is just a
> speculation.
>
> At any instant, I do appreciate the time everybody unselfishly devotes to
> communicating with people experiencing problems.
>
> Thanks,
> Dimitar
>
> --
> gmx-users mailing list     <gmx-users at gromacs.org>gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read <http://www.gromacs.org/Support/Mailing_Lists>
> http://www.gromacs.org/Support/Mailing_Lists
>
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>

-- 
=====================================================
*Dimitar V Pachov*

PhD Physics
Postdoctoral Fellow
HHMI & Biochemistry Department        Phone: (781) 736-2326
Brandeis University, MS 057                Email: dpachov at brandeis.edu
=====================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110604/5cbeddff/attachment.html>