[Fwd: Re: [gmx-users] mdrun CVS version crashes instantly when run across nodes in parallel]

Carsten Kutzner ckutzne at gwdg.de
Tue Jan 22 20:15:03 CET 2008


Hi Erik,

I have made a test with today's CVS version and I also run into the
problem you described. It happens as soon as one uses more than one node
and at the same time more than one process per node.

The problem seems to be in gmx_sumd where in the two-step summation in a
call to MPI_Allreduce the variable cr->nc.comm_inter happens to be a
NULL pointer, which should clearly not be.

The inter-node communicator is freed in gmx_setup_nodecomm (network.c,
line 393) if an intra-node communicator is present - I do not understand
 why the communicator is freed here.

Maybe Berk can help us on that? If I comment out the MPI_Comm_free the
code runs happily - haven't checked the results, though.

Carsten



Erik Brandt wrote:
> Hi,
> First, let my thank you for your help so far.
> I have actually run a very similar ring program written in C for
> different message sizes up to 1000000 reals and there are no problems
> what so ever. For clarity, let me attach the code of this simple MPI
> test program:
> 
> #include <stdio.h>
> #include <mpi.h>
> 
> void main (int argc, char *argv[])
> {
>   int my_rank, size;
>   int sum;
> 
>   MPI_Init(&argc, &argv);
> 
>   MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
> 
>   MPI_Comm_size(MPI_COMM_WORLD, &size);
> 
>   /* Compute partial rank sum. */
> 
>   MPI_Allreduce (&my_rank, &sum, 1, MPI_INT,
>                   MPI_SUM, MPI_COMM_WORLD);
> 
>   printf ("PE%i:\tSum = %i\n", my_rank, sum);
> 
>   MPI_Finalize();
> }
> 
> I use the GNU compilers that are distributed with Ubuntu 7.10 for both C
> and Fortran
> gcc (GCC) 4.1.3 20070929 (prerelease)
> GNU Fortran (GCC) 4.2.1
> 
> Regards
> / Erik
> 
> tis 2008-01-22 klockan 16:49 +0100 skrev Carsten Kutzner:
>> Erik,
>>
>> you could test the MPI_Allreduce in a C program - after all the example
>> test program is Fortran. At the same time I would check for various
>> message sizes (e.g. 1, 10, 100, 1000 reals) - it could be that OpenMPI
>> uses different algorithms for the Allreduce depending on the size of the
>> message.
>>
>> Carsten
>>
>>
>> Erik Brandt wrote:
>> > Hello.
>> > Yes, I have tried that little program and it works, no problems. Is
>> > there any other information you might need that I have forgotten to mention?
>> > 
>> > Regards
>> > / Erik
>> > 
>> > tis 2008-01-22 klockan 14:20 +0100 skrev Carsten Kutzner:
>> >> Hi Erik,
>> >>
>> >> have you tried the small MPI test program to which your link points?
>> >> > http://www.open-mpi.org/community/lists/users/2006/04/0978.php
>> >>
>> >> This would help to figure out whether the problem is on the Gromacs or
>> >> on the MPI side. From the result that Gromacs 3.3.x works on your
>> >> cluster, unfortunately nothing can be concluded, since 3.3 does not use
>> >> MPI_Allreduce call.
>> >>
>> >> Just a few weeks ago our computing center detected a bug in the 64-bit
>> >> version of MPI_Reduce in the MVAPICH/MVAPICH2 libraries. So there might
>> >> be a similar problem here ...
>> >>
>> >> Carsten
>> >>
>> >>
>> >> Erik Brandt wrote:
>> >> > Hello Gromacs users.
>> >> > 
>> >> > In the CVS version, I experience that mdrun crashes instantly when run
>> >> > in parallel across nodes (for any simulation system). The cluster
>> >> > consists of 8 nodes with Intel 6600 Quad-Core processors. As long as a
>> >> > job is run on a single node (using 1,2 or 4 CPU:s) everything works
>> >> > fine but when trying to run on several nodes mdrun crashes directly
>> >> > with the following error message (no output or log files are written to
>> >> > disk):
>> >> > 
>> >> >> Getting Loaded...
>> >> >> Reading file topol.tpr, VERSION 3.3.99_development_20071104 (single
>> >> > precision)
>> >> >> Loaded with Money
>> >> >>
>> >> >> [warhol8:29695] *** An error occurred in MPI_Allreduce
>> >> >> [warhol8:29695] *** on communicator MPI_COMM_WORLD
>> >> >> [warhol8:29695] *** MPI_ERR_COMM: invalid communicator
>> >> >> [warhol8:29695] *** MPI_ERRORS_ARE_FATAL (goodbye)
>> >> > 
>> >> > For the 1024 DPPC benchmark system the following two commands were
>> >> > used to start the simulation (default names on input files):
>> >> > 
>> >> >> /opt/gromacs/cvs/bin/grompp
>> >> >> /opt/openmp/1.2.4/bin/mpirun --hostfile hostfile
>> >> > /opt/gromacs/cvs/bin/mdrun_mpi -v -dd 2 2 2
>> >> > 
>> >> > where hostfile contains two specific nodes with 4 slots each.
>> >> > 
>> >> > The OS is Ubuntu 7.10 x86_64 on all nodes. mdrun_mpi is compiled with
>> >> > OpenMPI 1.2.4 but I have also tried with LAM/MPI 7.1.2 and it crashes
>> >> > in the same manner with an identical error message. Furthermore I have
>> >> > tried a static compilation on another cluster (Intel Xeon EM64T
>> >> > Processors) and copied the files to our cluster with the same
>> >> > result. I have searched the web for this error and there are some
>> >> > suggestions that this may be related to  64 bit architecture, see e.g.
>> >> > 
>> >> > http://www.open-mpi.org/community/lists/users/2006/04/0978.php
>> >> > 
>> >> > The MPI installation on the cluster works for the 3.3.2 version of
>> >> > Gromacs and also for some simple test programs for MPI such as nodes
>> >> > writing out their name and rank.
>> >> > 
>> >> > Does anyone have any ideas on the origins of these crashes and/or
>> >> > suggestions on how to resolve them?
>> >> > 
>> >> > Regards
>> >> > Erik Brandt
>> >> > 
>> >> > Ph.D. Student
>> >> > Theoretical Physics, KTH, Stockholm, Sweden
>> >> > 
>> >> > -- 
>> >> > Erik Brandt <erikb at theophys.kth.se <mailto:erikb at theophys.kth.se> <mailto:erikb at theophys.kth.se> <mailto:erikb at theophys.kth.se>>
>> >> > KTH
>> >> > 
>> >> > 
>> >> > ------------------------------------------------------------------------
>> >> > 
>> >> > _______________________________________________
>> >> > gmx-users mailing list    gmx-users at gromacs.org <mailto:gmx-users at gromacs.org> <mailto:gmx-users at gromacs.org>
>> >> > http://www.gromacs.org/mailman/listinfo/gmx-users
>> >> > Please search the archive at http://www.gromacs.org/search before posting!
>> >> > Please don't post (un)subscribe requests to the list. Use the 
>> >> > www interface or send it to gmx-users-request at gromacs.org <mailto:gmx-users-request at gromacs.org> <mailto:gmx-users-request at gromacs.org>.
>> >> > Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>> >>
>> > -- 
>> > Erik Brandt <erikb at theophys.kth.se <mailto:erikb at theophys.kth.se> <mailto:erikb at theophys.kth.se>>
>> > KTH
>> > 
>>
> -- 
> Erik Brandt <erikb at theophys.kth.se <mailto:erikb at theophys.kth.se>>
> KTH
> 

-- 
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/research/dep/grubmueller/
http://www.gwdg.de/~ckutzne




More information about the gromacs.org_gmx-users mailing list