[gmx-developers] Debugging MPI problems?
Marc Baaden
baaden at smplinux.de
Mon Oct 1 12:12:32 CEST 2007
Hi,
we have implemented some interactivity in Gromacs, using the IMD
protocol. Our code works absolutely fine with the latest CVS version
using a single processor (tested on MacOSX, Linux, IBM AIX).
When increasing the number of processors, it still works fine on
IBM AIX, but on MacOSX and Linux we get MPI errors (see below).
Any hints on how to try and find out what's going on ?
(We have only modified 3 files: sim_util.c, md.c and mdrun.c
adding a total of ca. 5 lines.. our modifications are serial
code, no parallelism needed)
Thanks in advance for any hints and tips,
Marc Baaden
ERROR ON MACOSX
===============
[..]
starting mdrun 'toto'
5000000 steps, 200000.0 ps.
IIMD > ---- Entering in iimd_init
IIMD > Interactive MD bind to port 3000
IIMD > ---- Entering in iimd_treateven
IIMD > ---- Entering in iimd_probeconnection
IIMD > Awaiting connection
[lux:07872] *** Process received signal ***
[lux:07872] Signal: Bus error (10)
[lux:07872] Associated errno: Unknown error: 1684890368 (1684890368)
[lux:07872] Signal code: (1701734764)
[lux:07872] Failing at address: 0x454d4954
[ 1] [0xbfffcb48, 0x0000000c] (-P-)
[ 2] (vfprintf_l + 0x5e) [0xbfffcb78, 0x900e3b50]
[ 3] (fprintf + 0x49) [0xbfffcba8, 0x90010c49]
[ 4] (gimd_ext_forces + 0x6a4) [0xbfffcc28, 0x0008677c]
[ 5] (do_force + 0xd69) [0xbfffcd68, 0x00036540]
[ 6] (do_md + 0x1dcd) [0xbfffd218, 0x00017b54]
[ 7] (mdrunner + 0xdf2) [0xbfffd368, 0x000158a2]
[ 8] (main + 0x598) [0xbfffd418, 0x0001a416]
[ 9] (_start + 0xd8) [0xbfffd458, 0x00001f22]
[10] (start + 0x29) [0xbfffd470, 0x00001e49]
[11] [0x00000000, 0x00000013] (FP-)
[lux:07872] *** End of error message ***
om-mpirun noticed that job rank 1 with PID 7872 on node lux.lbt.ibpc.fr exited on signal 10 (Bus error).
1 process killed (possibly by Open MPI)
ERROR ON LINUX
==============
[..]
starting mdrun 'toto'
5000000 steps, 200000.0 ps.
IIMD > ---- Entering in iimd_init
IIMD > Interactive MD bind to port 3000
IIMD > ---- Entering in iimd_treateven
IIMD > ---- Entering in iimd_probeconnection
IIMD > Awaiting connection
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 12401 failed on node n0 (127.0.0.1) due to signal 11.
-----------------------------------------------------------------------------
make: *** [run] Error 11
--
Dr. Marc Baaden - Institut de Biologie Physico-Chimique, Paris
mailto:baaden at smplinux.de - http://www.baaden.ibpc.fr
FAX: +33 15841 5026 - Tel: +33 15841 5176 ou +33 609 843217
More information about the gromacs.org_gmx-developers
mailing list