[gmx-developers] Debugging MPI problems?

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Mon Oct 1 12:30:55 CEST 2007


On Mon, 1 Oct 2007, Marc Baaden wrote:


marc,

what you describe looks a lot like you have a mismatch 
in integer data types. on a big endian machine that is
harmless, because you may only need to read the lower
half of the data, but on the little endians, you may 
violate access boundarie restrictions.

you should try to enforce a core dump and have a closer
look through the stack frames.

cheers,
   axel.

MB> 
MB> Hi,
MB> 
MB> we have implemented some interactivity in Gromacs, using the IMD 
MB> protocol. Our code works absolutely fine with the latest CVS version
MB> using a single processor (tested on MacOSX, Linux, IBM AIX).
MB> When increasing the number of processors, it still works fine on
MB> IBM AIX, but on MacOSX and Linux we get MPI errors (see below).
MB> 
MB> Any hints on how to try and find out what's going on ?
MB> 
MB> (We have only modified 3 files: sim_util.c, md.c and mdrun.c
MB>  adding a total of ca. 5 lines..  our modifications are serial
MB> code, no parallelism needed)
MB> 
MB> Thanks in advance for any hints and tips,
MB>   Marc Baaden
MB> 
MB> 
MB> ERROR ON MACOSX
MB> ===============
MB> 
MB> [..]
MB> starting mdrun 'toto'
MB> 5000000 steps, 200000.0 ps.
MB> IIMD > ---- Entering in iimd_init
MB> IIMD > Interactive MD bind to port 3000 
MB> IIMD > ---- Entering in iimd_treateven
MB> IIMD > ---- Entering in iimd_probeconnection
MB> IIMD > Awaiting connection
MB> [lux:07872] *** Process received signal ***
MB> [lux:07872] Signal: Bus error (10)
MB> [lux:07872] Associated errno: Unknown error: 1684890368 (1684890368)
MB> [lux:07872] Signal code:  (1701734764)
MB> [lux:07872] Failing at address: 0x454d4954
MB> [ 1] [0xbfffcb48, 0x0000000c] (-P-)
MB> [ 2] (vfprintf_l + 0x5e) [0xbfffcb78, 0x900e3b50] 
MB> [ 3] (fprintf + 0x49) [0xbfffcba8, 0x90010c49] 
MB> [ 4] (gimd_ext_forces + 0x6a4) [0xbfffcc28, 0x0008677c] 
MB> [ 5] (do_force + 0xd69) [0xbfffcd68, 0x00036540] 
MB> [ 6] (do_md + 0x1dcd) [0xbfffd218, 0x00017b54] 
MB> [ 7] (mdrunner + 0xdf2) [0xbfffd368, 0x000158a2] 
MB> [ 8] (main + 0x598) [0xbfffd418, 0x0001a416] 
MB> [ 9] (_start + 0xd8) [0xbfffd458, 0x00001f22] 
MB> [10] (start + 0x29) [0xbfffd470, 0x00001e49] 
MB> [11] [0x00000000, 0x00000013] (FP-)
MB> [lux:07872] *** End of error message ***
MB> om-mpirun noticed that job rank 1 with PID 7872 on node lux.lbt.ibpc.fr exited on signal 10 (Bus error). 
MB> 1 process killed (possibly by Open MPI)
MB> 
MB> 
MB> ERROR ON LINUX
MB> ==============
MB> 
MB> [..]
MB> starting mdrun 'toto'
MB> 5000000 steps, 200000.0 ps.
MB> IIMD > ---- Entering in iimd_init
MB> IIMD > Interactive MD bind to port 3000 
MB> IIMD > ---- Entering in iimd_treateven
MB> IIMD > ---- Entering in iimd_probeconnection
MB> IIMD > Awaiting connection
MB> -----------------------------------------------------------------------------
MB> One of the processes started by mpirun has exited with a nonzero exit
MB> code.  This typically indicates that the process finished in error.
MB> If your process did not finish in error, be sure to include a "return
MB> 0" or "exit(0)" in your C code before exiting the application.
MB> 
MB> PID 12401 failed on node n0 (127.0.0.1) due to signal 11.
MB> -----------------------------------------------------------------------------
MB> make: *** [run] Error 11
MB> 
MB> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.




More information about the gromacs.org_gmx-developers mailing list