[gmx-developers] Debugging MPI problems?

Marc Baaden baaden at smplinux.de
Mon Oct 1 12:12:32 CEST 2007


Hi,

we have implemented some interactivity in Gromacs, using the IMD 
protocol. Our code works absolutely fine with the latest CVS version
using a single processor (tested on MacOSX, Linux, IBM AIX).
When increasing the number of processors, it still works fine on
IBM AIX, but on MacOSX and Linux we get MPI errors (see below).

Any hints on how to try and find out what's going on ?

(We have only modified 3 files: sim_util.c, md.c and mdrun.c
 adding a total of ca. 5 lines..  our modifications are serial
code, no parallelism needed)

Thanks in advance for any hints and tips,
  Marc Baaden


ERROR ON MACOSX
===============

[..]
starting mdrun 'toto'
5000000 steps, 200000.0 ps.
IIMD > ---- Entering in iimd_init
IIMD > Interactive MD bind to port 3000 
IIMD > ---- Entering in iimd_treateven
IIMD > ---- Entering in iimd_probeconnection
IIMD > Awaiting connection
[lux:07872] *** Process received signal ***
[lux:07872] Signal: Bus error (10)
[lux:07872] Associated errno: Unknown error: 1684890368 (1684890368)
[lux:07872] Signal code:  (1701734764)
[lux:07872] Failing at address: 0x454d4954
[ 1] [0xbfffcb48, 0x0000000c] (-P-)
[ 2] (vfprintf_l + 0x5e) [0xbfffcb78, 0x900e3b50] 
[ 3] (fprintf + 0x49) [0xbfffcba8, 0x90010c49] 
[ 4] (gimd_ext_forces + 0x6a4) [0xbfffcc28, 0x0008677c] 
[ 5] (do_force + 0xd69) [0xbfffcd68, 0x00036540] 
[ 6] (do_md + 0x1dcd) [0xbfffd218, 0x00017b54] 
[ 7] (mdrunner + 0xdf2) [0xbfffd368, 0x000158a2] 
[ 8] (main + 0x598) [0xbfffd418, 0x0001a416] 
[ 9] (_start + 0xd8) [0xbfffd458, 0x00001f22] 
[10] (start + 0x29) [0xbfffd470, 0x00001e49] 
[11] [0x00000000, 0x00000013] (FP-)
[lux:07872] *** End of error message ***
om-mpirun noticed that job rank 1 with PID 7872 on node lux.lbt.ibpc.fr exited on signal 10 (Bus error). 
1 process killed (possibly by Open MPI)


ERROR ON LINUX
==============

[..]
starting mdrun 'toto'
5000000 steps, 200000.0 ps.
IIMD > ---- Entering in iimd_init
IIMD > Interactive MD bind to port 3000 
IIMD > ---- Entering in iimd_treateven
IIMD > ---- Entering in iimd_probeconnection
IIMD > Awaiting connection
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 12401 failed on node n0 (127.0.0.1) due to signal 11.
-----------------------------------------------------------------------------
make: *** [run] Error 11

-- 
 Dr. Marc Baaden  - Institut de Biologie Physico-Chimique, Paris
 mailto:baaden at smplinux.de      -      http://www.baaden.ibpc.fr
 FAX: +33 15841 5026  -  Tel: +33 15841 5176  ou  +33 609 843217





More information about the gromacs.org_gmx-developers mailing list