[gmx-users] Why does the -append option exist?

Sun Jun 5 10:37:13 CEST 2011

On 5/06/2011 5:42 PM, Dimitar Pachov wrote:
>
>
> On Sun, Jun 5, 2011 at 2:14 AM, Mark Abraham <Mark.Abraham at anu.edu.au 
> <mailto:Mark.Abraham at anu.edu.au>> wrote:
>
>     On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
>>     As I said, the queue is like this: you submit the job, it finds
>>     an empty node, it goes there, however seconds later another user
>>     with higher privileges on that particular node submits a job, his
>>     job kicks out my job, mine goes on the queue again, it finds
>>     another empty node, goes there, then another user with
>>     high privileges on that node submits a job, which consequently
>>     kicks out my job again, and the cycle repeats itself ...
>>     theoretically, it could continue forever, depending on how many
>>     and where the empty nodes are, if any.
>
>     You've said that *now* - but previously you've said nothing about
>     why you were getting lots of restarts. In my experience, PBS
>     queues suspend jobs rather than deleting them, in order that
>     resources are not wasted. Apparently other places do things this
>     way. I think that this information is highly relevant to
>     explaining your observations.
>
>
>
> The point was not "why" I was getting the restarts, but the fact 
> itself that I was getting restarts close in time, as I stated in my 
> first post. I actually also don't know whether jobs are deleted or 
> suspended. I've thought that a job returned back to the queue will 
> basically start from the beginning when later moved to an empty slot 
> ... so don't understand the difference from that perspective.

It's the difference between a process being killed, and a process being 
allowed to survive but temporarily without access to the CPU. Operating 
systems routinely share the CPU over multiple execution threads. Job 
suspension just adapts that idea.

Also, different UNIX signals are interpreted differently by the GROMACS 
signal handler. It respects hard kills, but it cooperates with gentler 
kills by updating the checkpoint file at the next neighbour-search step, 
IIRC. Perhaps your PBS is making excessive use of hard kills - if it 
didn't, you still get to make some progress when you only get a minute 
of CPU time...

>
>>     These many restarts suggest that the queue was full with
>>     relatively short jobs ran by users with high privileges.
>>     Technically, I cannot see why the same processes should be
>>     running simultaneously because at any instant my job runs only on
>>     one node, or it stays in the queuing list.
>
>     I/O can be buffered such that the termination of the process and
>     the completion of its I/O are asynchronous. Perhaps it *shouldn't*
>     be that way, but this is a problem for the administrators of your
>     cluster to address. They know how the file system works. If the
>     next job executes before the old one has finished output, then I
>     think the symptoms you observe might be possible.
>
>
> Yes, this is true, and I believe the timing of when the buffer is 
> fully flushed is crucial in providing a possible explanation in the 
> observed behavior. However, this bottleneck has been known for a long 
> time, so I expected people had thought about that before confidently 
> putting -append as a default. That's all.

Judging by the frequency of people reporting problems, most people don't 
encounter the kind of "file system latency leading to race condition" 
problem I think that you're seeing. Some might see it, and just work 
around, as you say. Or other people just don't have the combination of 
file system and compute resource management that you have to work with.
>
>
>     Note that there is nothing GROMACS can do about that, unless
>     somehow GROMACS can apply a lock in the first mdrun that is
>     respected by your file system such that a subsequent mdrun cannot
>     open the same file until all pending I/O has completed. I'd expect
>     proper HPC file systems do that automatically, but I don't really
>     know.
>
>
> I am not an expert nor do I know the Gromacs coding, but could one 
> have an option to specify certain timing before which Gromacs is 
> prohibited to output/write any files after its initial start, i.e. 
> some kind of suspension and/or waiting period?

One could delay some/all output initialization until the first write, 
but it probably makes the code rather more messy. GROMACS does check 
that the state of the output files make sense, by computing and 
comparing checksums stored in the checkpoint file. One has to draw a 
line somewhere. If the contents of those files might be changed by 
another process, then efficient MD is simply impossible. Also, there 
would be people complain that they spent 15 minutes on their 
1024-processor simulation before it died when the lack of write 
permission for the checkpoint filename got noticed. Perhaps not that 
exact scenario, but similar could arise.

You can emulate this yourself by calling "sleep 10s" before mdrun and 
see if that's long enough to solve the latency issue in your case.

It seems to me that this kind of file locking ought to be the 
responsibility of the file system. Allowing a new process to access a 
file when there's buffered output pending seems wrong. It just asks for 
these kind of race conditions to arise. (Assuming my theory is sound...)

> I am also wondering about the checkpoint timing - the default is 15 
> min, but what would be the minimum? Since I have not tested it, what 
> would happen if I specify 0.001 min, for example?

I/O takes time, and checkpointing requires global communication to 
prepare for it. Doing it more often than one needs to do it is wasteful. 
Your situation sounds so volatile that checkpointing every 30s is 
probably sound. On a BlueGene, about the only reason to checkpoint is a 
power outage. One size can't fit all.

>
>     Words are open to interpretation. Communicating well requires that
>     you consider the impact of your words on your reader. You want
>     people who can address the problem to want to help. You don't want
>     them to feel defensive about the situation - whether you think
>     that would be an over-reaction or not.
>
>
> I got your point(s). However, I respectfully disagree with some of 
> them. First, I believe it is much more important what information 
> one's sentences bring rather than how specifically they are written.

The content is very important. Terse and informative is often much 
better than waffling vagueness. However, given a range of presentations 
with the same content, why not choose a presentation that improves the 
chance of achieving the objective?

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110605/f806dee1/attachment.html>