[gmx-users] Re: restarting jobs

Thu Mar 11 21:14:01 CET 2004

Thanks everyone. The system is a dual Xeon running linux 2.4.18-14smp.  The jobs are very 
stable prior to restart attempts.  They ran out to about 25 ns
before I had to bring down the system.  Other jobs from the same
mdp file have run to completion (30 ns), and then another 30ns after continuation 
using the resulting .gro file and the same initial .tpr file. 
For jobs that had be be prematurely killed, therefore not resulting in 
a new .gro file, no matter
what time point I restart from 20 ns, 21 ns, 22 ns etc., the restart dies
with the Large VCM error.  It may be a system issue, but I only experience
the problem when having to restart a killed job.  I don't think
there is really a bad water contact since the simulation ran to 25 ns
without one.  A perfect restart at 20 ns should not result in a bad water
contact picoseconds after 20ns in a reproducible deterministic system.
I'll try removing the water and see what happens, but I think the problem
lies elsewhere.
Also, in response to the NFS issue, I am working locally on my filesystem.

thanks again,

Marc

> Hi, Marc,
> 
>     I have the same problem when using Itannium2 cluster. I work on 4 systems, and found it always has problem when restaring job. The error info is same as yours, i.e.: Large VCM(group System):      0.00017,      0.00058, 22456334336.00000, ekin-cm:  1.94508e+26
> 
>     In fact, I have finished almost 1.5ns simulation on Itanium2 cluster with 8 CPU, then the job crashes due to bad node. Then I resubmit job, after several hundrend ps simulation, it crashes and complain Large VCM(group System) and generate stepXXXX.pdb files. I do not think it is our system or protein has problem, the problem comes from machine, because I work on four systems, which have the same problem. In order to demonstrate this idea, I submit the same job in workstation with the same CPU, you know, workstation is much stable than Linux cluster, and my job successfully runs to 2 ns and is still running. You know, when I submit this job in PC cluster, the job crash at 1.5 ns due to node problem, then I resubmit job, it crashes at 1.6ns. So, I think, the error has nothing to do with your system, but it is related to the machine you used. Can you tell what machine do you use? Is it Linux cluster? If so, that demonstrates the job crashing is due to machine problem. No ma!
tter how do you optimize your sys
> 
>      Does anybody has the same experience? How to solve the restarting problem on Linux cluster or Itanium cluster? Any hints and disccussion will be appreciated!
>        
> Cheers,
> 
> Linda
> 
> 
> 
> _______________________________________________
> gmx-users mailing list
> gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
>