[gmx-users] Two machines, same job, one fails
Mark Abraham
mark.abraham at anu.edu.au
Wed Jan 26 00:54:32 CET 2011
On 01/26/11, TJ Mustard <mustardt at onid.orst.edu> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On January 25, 2011 at 3:24 PM "Justin A. Lemkul" <jalemkul at vt.edu> wrote:
>
>
>
> >
>
> >
>
> > TJ Mustard wrote:
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > On January 25, 2011 at 2:08 PM Mark Abraham <Mark.Abraham at anu.edu.au> wrote:
>
> > >
>
> > >> On 26/01/2011 5:50 AM, TJ Mustard wrote:
>
> > >>>
>
> > >>> Hi all,
>
> > >>>
>
> > >>>
>
> > >>>
>
> > >>> I am running MD/FEP on a protein-ligand system with gromacs 4.5.3 and
>
> > >>> FFTW 3.2.2.
>
> > >>>
>
> > >>>
>
> > >>>
>
> > >>> My iMac will run the job (over 4000 steps, till I killed it) at 4fs
>
> > >>> steps. (I am using heavy H)
>
> > >>>
>
> > >>>
>
> > >>>
>
> > >>> Once I put this on our groups AMD Cluster the jobs fail even with 2fs
>
> > >>> steps. (with thousands of lincs errors)
>
> > >>>
>
> > >>>
>
> > >>>
>
> > >>> We have recompiled the clusters gromacs 4.5.3 build, with no change.
>
> > >>> I know the system is the same since I copied the job from the server
>
> > >>> to my machine, to rerun it.
>
> > >>>
>
> > >>>
>
> > >>>
>
> > >>> What is going on? Why can one machine run a job perfectly and the
>
> > >>> other cannot? I also know there is adequate memory on both machines.
>
> > >>>
>
> > >>
>
> > >> You've posted this before, and I made a number of diagnostic
>
> > >> suggestions. What did you learn?
>
> > >>
>
> > >> Mark
>
> > >
>
> > > Mark and all,
>
> > >
>
> > >
>
> > >
>
> > > First thank you for all our help. What you suggested last time helped
>
> > > considerably with our jobs/calculations. I have learned that using the
>
> > > standard mdp settings allow my heavyh 4fs jobs to run on my iMac (intel)
>
> > > and have made these my new standard for future jobs. We chose to use the
>
> > > smaller 0.8nm PME/Cutoff due to others papers/tutorials, but now we
>
> > > understand why we need these standard settings. Now what I see to be our
>
> > > problem is that our machines have some sort of variable we cannot
>
> > > account for. If I am blind to my error, please show me. I just don't
>
> > > understand why one computer works while the other does not. We have
>
> > > recompiled gromacs 4.5.3 single precission on our cluster, and still
>
> > > have this problem.
>
> > >
>
> >
>
> > I know the feeling all too well. PowerPC jobs crash instantly, on our cluster,
>
> > despite working beautifully on our lab machines. There's a bug report about
>
> > that one, but I haven't heard anything about AMD failures. It remains a
>
> > possibility that something beyond your control is going on. To explore a bit
>
> > further:
>
> >
>
> > 1. Do the systems in question crash immediately (i.e., step zero) or do they run
>
> > for some time?
>
> >
>
>
>
>
> Step 0, every time.
>
>
>
>
>
>
>
> > 2. If they give you even a little bit of output, you can analyze which energy
>
> > terms, etc go haywire with the tips listed here:
>
> >
>
>
>
>
> All I have seen on these is LINCS Errors and Water molecules unable to be settled.
>
>
>
>
>
>
>
> But I will check this out right now, and email if I smell trouble.
>
>
>
>
>
>
>
> > http://www.gromacs.org/Documentation/Terminology/Blowing_Up#Diagnosing_an_Unstable_System
>
> >
>
> > That would help in tracking down any potential bug or error.
>
> >
>
> > 3. Is it just the production runs that are crashing, or everything? If EM isn't
>
> > even working, that smells even buggier.
>
>
>
>
> Awesome question here, we have seen some weird stuff. Sometimes the cluster will give us segmentation faults, then it will fail on our machines or sometimes not on our iMacs. I know weird! If EM starts on the cluster it will finish. Where we have issues is in positional restraint (PR) and MD and MD/FEP. It doesn't matter if FEP is on or off in a MD (although we are using SD for these MD/FEP runs).
>
>
>
>
>
>
Good. That rules out FEP as the source of the problem, like I asked in your previous thread.
>
>
>
>
>
>
> >
>
> > 4. Are the compilers the same on the iMac vs. AMD cluster?
>
>
>
>
> No I am using x86_64-apple-darwin10 GCC 4.4.4 and the cluster is using x86_64-redhat-linux 4.1.2 GCC.
>
>
>
> I just did a quick yum search and there
> doesn't seem to be a newer GCC. We know you are going to cmake but we
> have yet to get it implemented on our cluster successfully.
>
>
>
>
There
have been doubts about the 4.1.x series of GCC compilers for GROMACS -
and IIRC 4.1.2 in particular (do search the archives yourself). Some
time back, Berk solicited actual accounts of problems and nobody
presented one. So we no longer have an official warning against using
it. However I'd say this is a candidate for the source of your problems.
I would ask your cluster admins to get and compile a source-code version of GCC for you
to try.
Mark
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110126/15188f43/attachment.html>
More information about the gromacs.org_gmx-users
mailing list