[gmx-users] Athlon cluster experience

Oliver Beckstein oliver at biop.ox.ac.uk
Wed Feb 19 15:41:41 CET 2003


On Athlon instabilities:

> On Tue, 18 Feb 2003, Erik Lindahl wrote:

[...]

> >
> > I didn't find any errors when I ran this for a week on a dozen of our
> > nodes, but I've heard rumors that some versions of Athlon MP have
> > problems with SMP synchronization. I have NO idea whether this is true,
> > but it might be worth to test
> >
> > 1. The burn-in program.
> > 2. Consistency of SMP vs. non-SMP runs.
> > 3. Different versions of LAM, and check if there really are any
> > reported problems...
> >

On Wed, 19 Feb 2003, Jay Mashl wrote:

> I see behavior similar to what Justin sees, even on a standalone
> workstation.  The BIOS reports a similar idle temperature.  What
> happens is that the network suddenly disappear, and the video signal
> is present but shows no image.  I have done a little testing with
> another number crunching program as well; there, the use of two
> threads always will eventually lock up the machine, whereas using
> one thread doesn't crash the machine.  I've recently done some
> gromacs runs of a certain system with dual threads.  When I let it
> run overnight in a warm office, the machine has always locked up but
> at all different times after the start of the run.  However when I
> am using the console, it doesn't crash.

> My thinking had been that the problem was mostly heat related,
> perhaps trigger by software, but I believe interactive use
> introduces enough idle/delay cycles so as to allow the cpus to
> "cool" just enough so as to avoid a crash.

> 
> In addition to Erik's enumeration, I would like to add that any systematic
> testing could look at
> 
>   4.  SMP 1.1 vs. 1.4 mode in the BIOS
>   5.  BIOS APIC support
>   6.  Athlon optimized vs. i386 LAM
>   7.  kernel pre-emption patch
> 

We have reproducible SMP failures on standalone Athlon MP 1900+ on
Tyan Tiger MP S2460 boards. Typically, single processor gromacs runs
rock-stable (as do perpetual, multiple kernel compiles and Eric's
burn-in programme). However, 2-processor jobs lock-up the machine (as
described by Jay) after a couple of minutes or hours (and then it is
pull-the-plug-time), and so does at least one other SMP applications
(which is not using lam).

I had someone suggesting setting the MP version in the bios from 1.4
to 1.1, and others mentioning memory, but the problem remains unsolved
and I have no idea how to pinpoint the error.  Hence, all pointers
(especially along the lines of the rumours Eric mentions and the
things Jay suggests) are very much appreciated.

We've got a minicluster of Athlons on Tyan Tiger MPX S2466N-4M boards
which run fine, though.

For the record:
OS RedHat Linux 8.0
vanilla kernel  2.4.18-19.8.0smp
gromacs 3.1.4, recent lam

Oliver

-- 
Oliver Beckstein * oliver at bioch.ox.ac.uk
 http://indigo1.biop.ox.ac.uk/oliver/




More information about the gromacs.org_gmx-users mailing list