[gmx-users] mdrun CVS version crashes instantly when run across nodes in parallel

Erik Brandt erikb at theophys.kth.se
Tue Jan 22 13:36:15 CET 2008


Hello Gromacs users.

In the CVS version, I experience that mdrun crashes instantly when run
in parallel across nodes (for any simulation system). The cluster
consists of 8 nodes with Intel 6600 Quad-Core processors. As long as a
job is run on a single node (using 1,2 or 4 CPU:s) everything works
fine but when trying to run on several nodes mdrun crashes directly
with the following error message (no output or log files are written to
disk):

> Getting Loaded...
> Reading file topol.tpr, VERSION 3.3.99_development_20071104 (single
precision)
> Loaded with Money
>
> [warhol8:29695] *** An error occurred in MPI_Allreduce
> [warhol8:29695] *** on communicator MPI_COMM_WORLD
> [warhol8:29695] *** MPI_ERR_COMM: invalid communicator
> [warhol8:29695] *** MPI_ERRORS_ARE_FATAL (goodbye)

For the 1024 DPPC benchmark system the following two commands were
used to start the simulation (default names on input files):

> /opt/gromacs/cvs/bin/grompp
> /opt/openmp/1.2.4/bin/mpirun --hostfile
hostfile /opt/gromacs/cvs/bin/mdrun_mpi -v -dd 2 2 2

where hostfile contains two specific nodes with 4 slots each.

The OS is Ubuntu 7.10 x86_64 on all nodes. mdrun_mpi is compiled with
OpenMPI 1.2.4 but I have also tried with LAM/MPI 7.1.2 and it crashes
in the same manner with an identical error message. Furthermore I have
tried a static compilation on another cluster (Intel Xeon EM64T
Processors) and copied the files to our cluster with the same
result. I have searched the web for this error and there are some
suggestions that this may be related to  64 bit architecture, see e.g.

http://www.open-mpi.org/community/lists/users/2006/04/0978.php

The MPI installation on the cluster works for the 3.3.2 version of
Gromacs and also for some simple test programs for MPI such as nodes
writing out their name and rank.

Does anyone have any ideas on the origins of these crashes and/or
suggestions on how to resolve them?

Regards
Erik Brandt

Ph.D. Student
Theoretical Physics, KTH, Stockholm, Sweden

-- 
Erik Brandt <erikb at theophys.kth.se>
KTH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20080122/3a2142db/attachment.html>


More information about the gromacs.org_gmx-users mailing list