[gmx-developers] Debugging MPI problems?

David van der Spoel spoel at xray.bmc.uu.se
Mon Oct 1 16:24:03 CEST 2007


Axel Kohlmeyer wrote:
> On Mon, 1 Oct 2007, Marc Baaden wrote:
> 
> 
> marc,
> 
> what you describe looks a lot like you have a mismatch 
> in integer data types. on a big endian machine that is
> harmless, because you may only need to read the lower
> half of the data, but on the little endians, you may 
> violate access boundarie restrictions.
> 
> you should try to enforce a core dump and have a closer
> look through the stack frames.
> 
> cheers,
>    axel.
> 
> MB> 
> MB> Hi,
> MB> 
> MB> we have implemented some interactivity in Gromacs, using the IMD 
> MB> protocol. Our code works absolutely fine with the latest CVS version
> MB> using a single processor (tested on MacOSX, Linux, IBM AIX).
> MB> When increasing the number of processors, it still works fine on
> MB> IBM AIX, but on MacOSX and Linux we get MPI errors (see below).
> MB> 
> MB> Any hints on how to try and find out what's going on ?
> MB> 
> MB> (We have only modified 3 files: sim_util.c, md.c and mdrun.c
> MB>  adding a total of ca. 5 lines..  our modifications are serial
> MB> code, no parallelism needed)
> MB> 
> MB> Thanks in advance for any hints and tips,
> MB>   Marc Baaden
> MB> 
> MB> 
> MB> ERROR ON MACOSX
> MB> ===============
> MB> 
> MB> [..]
> MB> starting mdrun 'toto'
> MB> 5000000 steps, 200000.0 ps.
> MB> IIMD > ---- Entering in iimd_init
> MB> IIMD > Interactive MD bind to port 3000 
> MB> IIMD > ---- Entering in iimd_treateven
> MB> IIMD > ---- Entering in iimd_probeconnection
> MB> IIMD > Awaiting connection
> MB> [lux:07872] *** Process received signal ***
> MB> [lux:07872] Signal: Bus error (10)
> MB> [lux:07872] Associated errno: Unknown error: 1684890368 (1684890368)
> MB> [lux:07872] Signal code:  (1701734764)
> MB> [lux:07872] Failing at address: 0x454d4954
> MB> [ 1] [0xbfffcb48, 0x0000000c] (-P-)
> MB> [ 2] (vfprintf_l + 0x5e) [0xbfffcb78, 0x900e3b50] 
> MB> [ 3] (fprintf + 0x49) [0xbfffcba8, 0x90010c49] 

your problem is most likely in the fprintf. If you compile with -g this 
stack dump will also include a line number.


> MB> [ 4] (gimd_ext_forces + 0x6a4) [0xbfffcc28, 0x0008677c] 
> MB> [ 5] (do_force + 0xd69) [0xbfffcd68, 0x00036540] 
> MB> [ 6] (do_md + 0x1dcd) [0xbfffd218, 0x00017b54] 
> MB> [ 7] (mdrunner + 0xdf2) [0xbfffd368, 0x000158a2] 
> MB> [ 8] (main + 0x598) [0xbfffd418, 0x0001a416] 
> MB> [ 9] (_start + 0xd8) [0xbfffd458, 0x00001f22] 
> MB> [10] (start + 0x29) [0xbfffd470, 0x00001e49] 
> MB> [11] [0x00000000, 0x00000013] (FP-)
> MB> [lux:07872] *** End of error message ***
> MB> om-mpirun noticed that job rank 1 with PID 7872 on node lux.lbt.ibpc.fr exited on signal 10 (Bus error). 
> MB> 1 process killed (possibly by Open MPI)
> MB> 
> MB> 
> MB> ERROR ON LINUX
> MB> ==============
> MB> 
> MB> [..]
> MB> starting mdrun 'toto'
> MB> 5000000 steps, 200000.0 ps.
> MB> IIMD > ---- Entering in iimd_init
> MB> IIMD > Interactive MD bind to port 3000 
> MB> IIMD > ---- Entering in iimd_treateven
> MB> IIMD > ---- Entering in iimd_probeconnection
> MB> IIMD > Awaiting connection
> MB> -----------------------------------------------------------------------------
> MB> One of the processes started by mpirun has exited with a nonzero exit
> MB> code.  This typically indicates that the process finished in error.
> MB> If your process did not finish in error, be sure to include a "return
> MB> 0" or "exit(0)" in your C code before exiting the application.
> MB> 
> MB> PID 12401 failed on node n0 (127.0.0.1) due to signal 11.
> MB> -----------------------------------------------------------------------------
> MB> make: *** [run] Error 11
> MB> 
> MB> 
> 


-- 
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se



More information about the gromacs.org_gmx-developers mailing list