[gmx-developers] MPI_ERR_COMM on 4.5.5-patches
Alexander Schlaich
alexander.schlaich at fu-berlin.de
Wed Aug 29 12:17:34 CEST 2012
Hi Berk,
your patch didn't fix the problem.
Following the program execution with a debugger, I found the MPI error
is thrown at src/mdlib/fft5d.c, line 196
fft5d_plan fft5d_plan_3d(int NG, int MG, int KG, MPI_Comm comm[2], int
flags, t_complex** rlin, t_complex** rlout)
192 /* comm, prank and P are in the order of the decomposition
(plan->cart is in the order of transposes) */
193 #ifdef GMX_MPI
194 if (GMX_PARALLEL_ENV_INITIALIZED && comm[0] != MPI_COMM_NULL)
195 {
196 -> MPI_Comm_size(comm[0],&P[0]);
197 MPI_Comm_rank(comm[0],&prank[0]);
198 }
199 else
It seems to me like the symbol MPI_COMM_NULL is not initialized (at
least it is not zero like I expected). Adding a #define MPI_COMM_NULL 0
(see the original discussion #931) solves all the problems, though I
know this is not the solution.
I think the MPI_COMM_NULL should be included by #include<mpi.h> and its
type and value be handled by the MPI implementation? So I don't have an
obvious fix...
Thanks for your help,
Alex
Am Dienstag, den 28.08.2012, 22:41 +0200 schrieb Berk Hess:
> Hi,
>
> I think I might have found it already.
> Could you try the fix below and report back if this solved the problem?
>
> Cheers,
>
> Berk
>
>
> index 735c0e8..e00fa6f 100644
> --- a/src/mdlib/pme.c
> +++ b/src/mdlib/pme.c
> @@ -1814,8 +1814,11 @@ static void init_atomcomm(gmx_pme_t
> pme,pme_atomcomm_t *atc, t_commrec *cr,
> if (pme->nnodes > 1)
> {
> atc->mpi_comm = pme->mpi_comm_d[dimind];
> - MPI_Comm_size(atc->mpi_comm,&atc->nslab);
> - MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
> + if (atc->mpi_comm != MPI_COMM_NULL)
> + {
> + MPI_Comm_size(atc->mpi_comm,&atc->nslab);
> + MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
> + }
> }
> if (debug)
> {
>
>
>
> On 08/28/2012 10:34 PM, Berk Hess wrote:
> > Hi,
> >
> > This seems to be a bug in Gromacs.
> > As this is not in a Gromacs release yet, we could resolve this without
> > a bug report.
> >
> > A you skilled enough that you can run this in a debugger and tell me
> > which MPI_comm_size
> > call in Gromacs is causing this?
> >
> > Cheers,
> >
> > Berk
> >
> > On 08/28/2012 07:39 PM, Alexander Schlaich wrote:
> >> Dear Gromacs team,
> >>
> >> I just tried to install the release-4.5.5_patches branch with
> >> --enable-mpi on our cluster (OpemMPI-1.4.2), resulting in an error
> >> when calling mdrun whith pme enabled:
> >>
> >> Reading file topol.tpr, VERSION 4.5.5-dev-20120810-2859895 (single
> >> precision)
> >> [sheldon:22663] *** An error occurred in MPI_comm_size
> >> [sheldon:22663] *** on communicator MPI_COMM_WORLD
> >> [sheldon:22663] *** MPI_ERR_COMM: invalid communicator
> >> [sheldon:22663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >>
> >> This seems to be related to a recent post on the list, however I
> >> could not find a solution:
> >> http://lists.gromacs.org/pipermail/gmx-users/2012-July/073316.html
> >> However, the 4.5.5 release version works fine.
> >>
> >> Taking a closer look I found commit
> >> dcf8b67e2801f994dae56374382b9e330833de30, "changed PME MPI_Comm
> >> comparisions to MPI_COMM_NULL, fixes #931" (Berk Hess). Apparently
> >> here the communicators were changed such that the initialization
> >> fails on my system. Reverting this single commit on the head of the
> >> release-4.5.5 branch solved the issue for me.
> >>
> >> As I am no MPI expert I would like to know if my MPI implementation
> >> is misbehaving here, if I made a configuration mistake or if I should
> >> file a bug report?
> >>
> >> Thanks for your help,
> >>
> >> Alex
More information about the gromacs.org_gmx-developers
mailing list