[gmx-users] Re: restarting jobs

Thu Mar 11 16:23:01 CET 2004

Hi, Marc,

    I have the same problem when using Itannium2 cluster. I work on 4 systems, and found it always has problem when restaring job. The error info is same as yours, i.e.: Large VCM(group System):      0.00017,      0.00058, 22456334336.00000, ekin-cm:  1.94508e+26

    In fact, I have finished almost 1.5ns simulation on Itanium2 cluster with 8 CPU, then the job crashes due to bad node. Then I resubmit job, after several hundrend ps simulation, it crashes and complain Large VCM(group System) and generate stepXXXX.pdb files. I do not think it is our system or protein has problem, the problem comes from machine, because I work on four systems, which have the same problem. In order to demonstrate this idea, I submit the same job in workstation with the same CPU, you know, workstation is much stable than Linux cluster, and my job successfully runs to 2 ns and is still running. You know, when I submit this job in PC cluster, the job crash at 1.5 ns due to node problem, then I resubmit job, it crashes at 1.6ns. So, I think, the error has nothing to do with your system, but it is related to the machine you used. Can you tell what machine do you use? Is it Linux cluster? If so, that demonstrates the job crashing is due to machine problem. No matter how do you optimize your sys

     Does anybody has the same experience? How to solve the restarting problem on Linux cluster or Itanium cluster? Any hints and disccussion will be appreciated!

Cheers,

Linda