[gmx-users] Re: problems with intel I7 (2.67 GHz)

Christof Koehler christof.koehler at bccms.uni-bremen.de
Fri Feb 12 16:04:43 CET 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello everybody.

I would like to chime in here also my problem might not be directly
related.



> The problem of Gromacs stalling on i7 when using multiple CPUs is a MPI
> problem. It is most likely caused by a shared memory bug in Open MPI
> that was fixed in the latest release (1.4.1).
>
> Switching to openmpi-1.4.1 solves the problem.


We are using openmpi-1.4.1 on Nehalem CPUs. With the current gromacs
4.0.7 I see reproducible segfaults when either

"numactl --cpunodebind=0 --membind=0 mpirun ..."
or
"mpirun --mca mpi_paffinity_alone"

is used, e.g.

/usr/local/x86_64.Linux/bin/mpirun -np 4 --mca mpi_paffinity_alone 1
/usr/local/stow/gromacs407/x86_64.Linux/bin/mdrun_407_mpi_d
[neuro36a:01728] *** Process received signal ***
[neuro36a:01728] Signal: Segmentation fault (11)
[neuro36a:01728] Signal code: Address not mapped (1)
[neuro36a:01728] Failing at address: 0x8
[neuro36a:01728] [ 0] [0x7ff120]
[neuro36a:01728] [ 1] [0x7f15a7]
[neuro36a:01728] [ 2] [0x7cadeb]
[neuro36a:01728] [ 3] [0x7cacd5]
[neuro36a:01728] [ 4] [0x6cc533]
[neuro36a:01728] [ 5] [0x6d704e]
[neuro36a:01728] [ 6] [0x4a1f9e]
[neuro36a:01728] [ 7] [0x49c6cc]
[neuro36a:01728] [ 8] [0x40e046]
[neuro36a:01728] [ 9] [0x800749]
[neuro36a:01728] [10] [0x4001b9]
[neuro36a:01728] *** End of error message ***
- --------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1728 on node neuro36a exited
on signal 11 (Segmentation fault).
- --------------------------------------------------------------------------

mpirun -V
mpirun (Open MPI) 1.4.1


Everything works as expected if no core binding is used at all. The
serial version built the same way only without the --enable-mpi switch
shows no problems if used with numactl.

The numactl/mpirun combination although a bit unusual works fine with
other codes, e.g. cpmd, vasp ..., as does the usual "mpirun --mca
mpi_paffinity_alone" switch.

Since we are using CPU binding to partition an eight core node into two
SGE slots with 4 cores each this situation is not optimal.

I will try openmpi 1.4.2 as soon as it has been released, though.



Best Regards

Christof Köhler





- --
Dr. rer. nat. Christof Köhler       email: c.koehler at bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-2486
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-4764
28359 Bremen

PGP:
http://www.bccms.uni-bremen.de/fileadmin/BCCMS/pgp_keys/ChristofKoehler_UniBremen.asc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFLdW4LRtHb9dSZpXwRAmAZAKCndmiG1VF1zFcWX6gNmkg5nFNgfwCfSyl2
bTORJsG7XkFZ8PghgSQFts0=
=DRvQ
-----END PGP SIGNATURE-----



More information about the gromacs.org_gmx-users mailing list