[gmx-users] Re: [gmx-users]mdrun crashed on upgraded Dual 1.0G Mac OS X

Fri May 23 01:05:02 CEST 2003

Hi 

We upgraded two old single 733/550 MHz cpu to same dual 1.0GHz cpu. It looked a 
easy job to do so and performance is kind of impressive. However, one of 
machine (550; installed mpi/fftw/gmx after upgrade) wasn't like the new fast 
brain so that it crashed often, say once a day. (vs. 733; didn't reinstalled 
3software after cpu-upgrade, as it had mpi-enable gmx) 

Here are the log files (1. from "About This Mac" menu ,which I don't get each 
time; 2. mdrun log with "-g") . All machine was on 3.1.5_pre1, 2.1.5fftw, 
6.5.9lammpi and Protein in water, PME, about 30 000atoms in the cubic box. The 
job was run on a single machine. 

I have information from Mac system on mdrun crashed, is it more useful  or 
possible to find out what caused this problem and what should I do prevent this 
problem in the future. There are many lists related to an error "caught a 
SIGSEGV"; it is something to do with PME or simulation box.   

Thank you,

Taeho
-------------------------
1. This log file from "About this Mac"(apple logo in the main menu)
mdrun.crash.log:
2003-05-20 13:11:24 -0400:
Date/Time: 2003-05-20 13:11:24 -0400
OS Version: 10.2.6 (Build 6L60)
Host: Y-G4-733.local.
Command: mdrun
PID: 10948
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_INVALID_ADDRESS (0x0001) at 0x3dd6b3a0
Thread 0 Crashed:
#0 0x0026f128 in lam_shfree
#1 0x00270dac in _shm_fastrecv
#2 0x0025e3c4 in _rpi_c2c_fastrecv
#3 0x00254ac8 in MPI_Recv
#4 0x00250980 in bcast_lin
#5 0x0025088c in MPI_Bcast
#6 0x0004adec in do_pme
#7 0x000422e8 in force
#8 0x0001e9c8 in do_force
#9 0x00015a40 in do_md
#10 0x00017700 in mdrunner
#11 0x00017d04 in main
#12 0x00002860 in _start (crt.c:267)
#13 0x000026e0 in start
PPC Thread State:
srr0: 0x0026f128 srr1: 0x0200f930 vrsave: 0x00000000
xer: 0x20000000 lr: 0x0026f0d8 ctr: 0x900482b8 mq: 0x00000000
r0: 0x43da0f22 r1: 0xbfffe700 r2: 0x00000000 r3: 0x00526800
r4: 0x00000000 r5: 0x00000001 r6: 0x00000010 r7: 0x0035c180
r8: 0x002a997c r9: 0x0035c180 r10: 0x3dd6b3a0 r11: 0x3da0f220
r12: 0x900482b8 r13: 0x00000000 r14: 0x00012750 r15: 0xbffff0f0
r16: 0x00000019 r17: 0xbffff150 r18: 0x00000000 r19: 0xbfffea18
r20: 0x002a9e80 r21: 0xbfffea88 r22: 0x002a093c r23: 0x00024640
r24: 0x00314814 r25: 0x00526810 r26: 0x00999640 r27: 0x003580e0
r28: 0x00000000 r29: 0x003580c0 r30: 0x00526810 r31: 0x0026f0c8

2. log file mdrun with "-g" option
MPI process rank 1 (n0, p10948) caught a SIGSEGV in MPI_Recv.
Rank (1, MPI_COMM_WORLD): Call stack within LAM:   
Rank (1, MPI_COMM_WORLD):  - MPI_Recv()
Rank (1, MPI_COMM_WORLD):  - MPI_Bcast()
Rank (1, MPI_COMM_WORLD):  - main()
-----------------------------------------------------------------------------

One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 10948 failed on node n0 with exit status 1.   
-----------------------------------------------------------------------------
Rank (1, MPI_COMM_WORLD): Call stack within LAM:   
Rank (1, MPI_COMM_WORLD):  - MPI_Recv()
Rank (1, MPI_COMM_WORLD):  - MPI_Bcast()
Rank (1, MPI_COMM_WORLD):  - main()