[gmx-users] Re: restarting jobs

Martina Bertsch, PhD mbe404 at lulu.it.northwestern.edu
Thu Mar 11 20:33:00 CET 2004

If your jobs are generally unstable on the cluster, you may be
experiencing input/output problems that may eventually cause a job to
crash. If you have an NFS mounted file system, it would be the most
likely suspect for causing I/O errors. Ask your system administrator to
unmount the NFS and use the straight disk I/O.

By the way, if after running conjugate gradient EM, my position
restrained MD still crashes, giving error messages such as:

"Large VCM (System)..."

I try a short PR equilibration stage without pressure scaling
(coupling), i.e., modify the *mdp file:

p coupling no

Martina Bertsch, Ph.D.

Linda wrote:

>Hi, Marc,
>    I have the same problem when using Itannium2 cluster. I work on 4 systems, and found it always has problem when restaring job. The error info is same as yours, i.e.: Large VCM(group System):      0.00017,      0.00058, 22456334336.00000, ekin-cm:  1.94508e+26
>    In fact, I have finished almost 1.5ns simulation on Itanium2 cluster with 8 CPU, then the job crashes due to bad node. Then I resubmit job, after several hundrend ps simulation, it crashes and complain Large VCM(group System) and generate stepXXXX.pdb files. I do not think it is our system or protein has problem, the problem comes from machine, because I work on four systems, which have the same problem. In order to demonstrate this idea, I submit the same job in workstation with the same CPU, you know, workstation is much stable than Linux cluster, and my job successfully runs to 2 ns and is still running. You know, when I submit this job in PC cluster, the job crash at 1.5 ns due to node problem, then I resubmit job, it crashes at 1.6ns. So, I think, the error has nothing to do with your system, but it is related to the machine you used. Can you tell what machine do you use? Is it Linux cluster? If so, that demonstrates the job crashing is due to machine problem. No matter how do you optimize your sys
>     Does anybody has the same experience? How to solve the restarting problem on Linux cluster or Itanium cluster? Any hints and disccussion will be appreciated!
>gmx-users mailing list
>gmx-users at gromacs.org
>Please don't post (un)subscribe requests to the list. Use the 
>www interface or send it to gmx-users-request at gromacs.org.

More information about the gromacs.org_gmx-users mailing list