[gmx-users] Athlon cluster experience

Jay Mashl mashl at uiuc.edu
Wed Feb 19 14:06:49 CET 2003

On Tue, 18 Feb 2003, Erik Lindahl wrote:

> On Tuesday, Feb 18, 2003, at 22:18 US/Pacific, Lynne E. Bilston wrote:
> > Justin,
> >
> > My dual Athlon cluster is about 10 months old (dual 1800MP processors
> > on a tyan MB). lm_sensors gives a temperature of about 49-55 degrees C
> > when running both processors on a job. Idling is about 42-45C. Yours
> > do seem a bit hot by those standards.
> >
> > I did initially have some problems with jobs quitting due to
> > overheating. It turned out our AC system was being switched off at
> > night. IHow warm is the room your cluster is in?
> >
> > Let me know if you want more info on my lm_sensors setup or output.
> >
> > -Lynne
> >
> Hi,
> A couple of months ago I created a small CPU burn-in (i.e. heater :-)
> program - it should be available on the contributions page at
> www.gromacs.org.
> Just for fun, I actually started writing a really tight assembly loop
> with SSE instructions, but when I installed LM-sensors according to
> Lynne's instructions I surprisingly found out that the first version
> ran colder than a normal Gromacs simulation (although it was hotter
> than any other burn-in program on the net.)
> I'm pretty sure this is because the Gromacs innerloops use both the SSE
> and integer parts of the CPU (and the cache & memory), so I simply
> wrote a new version with a very small program that calls one of the
> Gromacs innerloops, tweaking the neighborlists to make it as hot as
> possible.
> It probably runs 1-2 degrees hotter than normal Gromacs, but the main
> difference is that the results are compared with a "vanilla" C loop,
> and if there are any random changes during the run I print an error
> message.
> I didn't find any errors when I ran this for a week on a dozen of our
> nodes, but I've heard rumors that some versions of Athlon MP have
> problems with SMP synchronization. I have NO idea whether this is true,
> but it might be worth to test
> 1. The burn-in program.
> 2. Consistency of SMP vs. non-SMP runs.
> 3. Different versions of LAM, and check if there really are any
> reported problems...
> Cheers,
> Erik

I see behavior similar to what Justin sees, even on a standalone workstation.
The BIOS reports a similar idle temperature.  What happens is that the network
suddenly disappear, and the video signal is present but shows no image.

I have done a little testing with another number crunching program as well;
there, the use of two threads always will eventually lock up the machine,
whereas using one thread doesn't crash the machine.

I've recently done some gromacs runs of a certain system with dual threads.
When I let it run overnight in a warm office, the machine has always locked up
but at all different times after the start of the run.  However when I am using
the console, it doesn't crash.

My thinking had been that the problem was mostly heat related, perhaps trigger
by software, but I believe interactive use introduces enough idle/delay cycles
so as to allow the cpus to "cool" just enough so as to avoid a crash.

In addition to Erik's enumeration, I would like to add that any systematic
testing could look at

  4.  SMP 1.1 vs. 1.4 mode in the BIOS
  5.  BIOS APIC support
  6.  Athlon optimized vs. i386 LAM
  7.  kernel pre-emption patch


More information about the gromacs.org_gmx-users mailing list