[gmx-users] Re: restarting jobs
Marc Vogt
mvogt at es.chem.umass.edu
Thu Mar 11 22:30:01 CET 2004
Just realized that I upgraded from GROMACS 3.1.4 to 3.2. I started
the run in GROMACS 3.1.4 and upgraded to 3.2 before restarting the jobs.
Could this be the source of the problem?
Marc
>
> Thanks everyone. The system is a dual Xeon running linux 2.4.18-14smp. The jobs are very
> stable prior to restart attempts. They ran out to about 25 ns
> before I had to bring down the system. Other jobs from the same
> mdp file have run to completion (30 ns), and then another 30ns after continuation
> using the resulting .gro file and the same initial .tpr file.
> For jobs that had be be prematurely killed, therefore not resulting in
> a new .gro file, no matter
> what time point I restart from 20 ns, 21 ns, 22 ns etc., the restart dies
> with the Large VCM error. It may be a system issue, but I only experience
> the problem when having to restart a killed job. I don't think
> there is really a bad water contact since the simulation ran to 25 ns
> without one. A perfect restart at 20 ns should not result in a bad water
> contact picoseconds after 20ns in a reproducible deterministic system.
> I'll try removing the water and see what happens, but I think the problem
> lies elsewhere.
> Also, in response to the NFS issue, I am working locally on my filesystem.
>
> thanks again,
>
> Marc
>
>
> > Hi, Marc,
> >
> > I have the same problem when using Itannium2 cluster. I work on 4 systems, and found it always has problem when restaring job. The error info is same as yours, i.e.: Large VCM(group System): 0.00017, 0.00058, 22456334336.00000, ekin-cm: 1.94508e+26
> >
> > In fact, I have finished almost 1.5ns simulation on Itanium2 cluster with 8 CPU, then the job crashes due to bad node. Then I resubmit job, after several hundrend ps simulation, it crashes and complain Large VCM(group System) and generate stepXXXX.pdb files. I do not think it is our system or protein has problem, the problem comes from machine, because I work on four systems, which have the same problem. In order to demonstrate this idea, I submit the same job in workstation with the same CPU, you know, workstation is much stable than Linux cluster, and my job successfully runs to 2 ns and is still running. You know, when I submit this job in PC cluster, the job crash at 1.5 ns due to node problem, then I resubmit job, it crashes at 1.6ns. So, I think, the error has nothing to do with your system, but it is related to the machine you used. Can you tell what machine do you use? Is it Linux cluster? If so, that demonstrates the job crashing is due to machine problem. No !
ma!
> tter how do you optimize your sys
> >
> > Does anybody has the same experience? How to solve the restarting problem on Linux cluster or Itanium cluster? Any hints and disccussion will be appreciated!
> >
> > Cheers,
> >
> > Linda
> >
> >
> >
> > _______________________________________________
> > gmx-users mailing list
> > gmx-users at gromacs.org
> > http://www.gromacs.org/mailman/listinfo/gmx-users
> > Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> >
>
> _______________________________________________
> gmx-users mailing list
> gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list