[gmx-developers] MPI stall?

Michael Shirts michael.shirts at virginia.edu
Sun Dec 6 16:03:25 CET 2009


Hi, all-

I'm getting a weird MPI stall with the git master repository version.
I compiled with with debugging on and double precision, running on a 8
processor MacPro.
After running for 10 min or so parallelized 8 ways, it appears to
stall.  Attaching a debugger to the threads to see where it's stuck,
the backtrace on the head node was (removing arguments for clarity)

#0  0x907fb29a in write$NOCANCEL$UNIX2003 ()
#1  0x907fb1f2 in _swrite ()
#2  0x907fb11f in __sflush ()
#3  0x907ffcfc in __swbuf ()
#4  0x90838e92 in fputc ()
#5  0x000c2dfd in print_time (out=0xa00c7690, runtime=0xbfffd5e0,
step=44600, ir=0x1017e00, cr=0x9004e0) at sim_util.c:164
#6  0x00019215 in do_md  at md.c:2316
#7  0x00013138 in mdrunner  at md.c:216
#9  0x0001b9cc in main (argc=14, argv=0xbffff3a0) at mdrun.c:518

And for the other nodes;

#0  0x907c536a in swtch_pri ()
#1  0x90832e65 in sched_yield ()
#2  0x00a05515 in mca_pml_ob1_send ()
#3  0x00710445 in MPI_Sendrecv ()
#4  0x00048fe4 in dd_sendrecv_rvec (dd=0x91dc00, ddimind=0,
direction=1, buf_s=0x1034c00, n_s=333, buf_r=0xd22f38, n_r=360) at
domdec_network.c:115
#5  0x00029c32 in dd_move_x (dd=0x91dc00, box=0x9260fc, x=0xd21000) at
domdec.c:657
#6  0x000c3f77 in do_force  at sim_util.c:521
#7  0x00017478 in do_md  at md.c:1794
#8  0x00013138 in mdrunner at md.c:687
#9  0x00011cbb in mdrunner_threads  at md.c:216
#10 0x0001b9cc in main (argc=14, argv=0x9184e0) at mdrun.c:518

Any other observations of this?  Has this been seen on other MacPros?
With debugging on?

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
michael.shirts at virginia.edu
(434)-243-1821



More information about the gromacs.org_gmx-developers mailing list