[gmx-users] intermittent crashes running on multiple nodes

Ron Hills ronhills at gmail.com
Wed Jan 7 22:19:01 CET 2009


Dear GMX Users,
I have been running gromacs-4.0.2 and gromacs-4.0_rc3 in parallel on various
8-cores-per-node and 16-cores-per-node 64-bit linux clusters. While I am
able to run mpi without any problems on a single node (8 or 16 processes,
respectively), when running larger jobs on more than one node I invariably
get a crash either immediately or after several hours of correct simulation
output. Below are errors and compile options/details. The nodes seem to stop
communicating and stop producing output after some time even if I use mpirun
-q 0. These were mostly small simulations (40 angstrom cubic box containing
peptide(s) and water). Thanks, Ron Hills

setenv CC icc
setenv CXX icc
setenv F77 ifort #intel ifort 10.1.021 or 10.1.017
setenv MPICC "mpicc -cc=icc"  #using Pathscale/Qlogic "InfiniPath"
InfiniBand MPI or mvapich2-1.2-intel-ofed-1.2.5.5 or mvapich/1.0
setenv MPIF77 "mpif77 -fc=ifort"

> ***immediate error after job submission:
> tr029:36.Hardware problem: {[RXE EAGERTID Memory Parity]}
> tr024:14.ips_proto_connect: Couldn't connect to
> 172.17.19.29(LID=0x0025:2.0). Time elapased 00:00:30. Still trying...
> tr025:16.MPID_Key_Init: rank  16 (tr025): Detected Connection timeout:
> 172.17.19.29 (rank 32,33,34,35,36,37,38,39)

> ***termination after 30hrs running correctly on 8x8=64 cores:
> tr019:33.PIO Send Stall after at least 2.10M failed send attempts
> (elapsed=54232.018s, last=2119242.641s, pio_stall_count=1)
> (TxPktCnt=21654082263,RxPktCnt=21662363713) PIO Send Bufs port 1 with 8
bufs
> from
> 8 to 15. PIO avail regs:  <0>=(4145041114514105)  <1>=(1010545410441100)
> <2>=(15555554)  <3>=(0)  <4>=(0)  <5>=(0)  <6>=(0)  <7>=(0) . PIO shadow
> regs:  <0>=(41505001ebae4050)  (err=23)
> mdrun_mpi:14064 terminated with signal 11 at PC=61329f SP=7fbfffcd00.
> Backtrace:
> /uufs/
>
hec.utah.edu/common/vothfs/u0636784/gromacs-4.0_rc3/tlrd/bin/mdrun_mpi[0x61329f]<http://hec.utah.edu/common/vothfs/u0636784/gromacs-4.0_rc3/tlrd/bin/mdrun_mpi%5B0x61329f%5D>
> MPIRUN.tr012: 26 ranks have not yet exited 60 seconds after rank 37 (node
> tr019) exited without reaching MPI_Finalize().
> MPIRUN.tr012: Waiting at most another 60 seconds for the remaining ranks
to
> do a clean shutdown before terminating 26 node processes

> ***error from a coworker:
> mdrun_mpi:27203 terminated with signal 11 at PC=469daa SP=7fbfffe080.
>  Backtrace:
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(do_pme+0x2f8e)[0x469daa]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(force+0x6be)[0x443d4e]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(do_force+0xb7b)[0x47d3f1]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(do_md+0x19c4)[0x42b360]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(mdrunner+0xc15)[0x4297b5]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(main+0x2ad)[0x42ccd1]
> /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x2a96a5e40b]
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi[0x41781a]
> MPIRUN.tr082: 15 ranks have not yet exited 60 seconds after rank 12 (node
> tr086) exited without reaching MPI_Finalize().
> MPIRUN.tr082: Waiting at most another 60 seconds for the remaining ranks
to
> do a clean shutdown before terminating 15 node processes

***Using mpirun -q 0 I get the following errors after completing 460,000
dynamics steps with no errors:
tr006:6.PIO Send Stall after at least 2.10M failed send attempts
(elapsed=272.699s, last=2462705.468s, pio_stall_count=1)
(TxPktCnt=5960586432,RxPktCnt=5963056955) PIO Send Bufs port 3 with 8 bufs
from 32 to 39. PIO avail regs:  <0>=(1455444101454155)
<1>=(4100140514101400)  <2>=(45100000)  <3>=(0)  <4>=(0)  <5>=(0)  <6>=(0)
<7>=(0) . PIO shadow regs:  <1>=(405541050145ebff)  (err=23)
tr037:39.PIO Send Stall after at least 2.10M failed send attempts
(elapsed=278.602s, last=4999304.123s, pio_stall_count=1)
(TxPktCnt=61756904051,RxPktCnt=61772810688) PIO Send Bufs port 1 with 8 bufs
from 0 to 7. PIO avail regs:  <0>=(504400541150401)  <1>=(5044450510455544)
<2>=(14155155)  <3>=(0)  <4>=(0)  <5>=(0)  <6>=(0)  <7>=(0) . PIO shadow
regs:  <0>=(500415154014fbfe)  (err=23)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20090107/040b39af/attachment.html>


More information about the gromacs.org_gmx-users mailing list