[gmx-developers] MPI_ERR_COMM on 4.5.5-patches

Alexander Schlaich alexander.schlaich at fu-berlin.de
Wed Aug 29 13:57:49 CEST 2012


Great, this solves my problems!
Should should your first patch also be applied? It is just an additional check and for nnodes>1 the communicator should be valid in my opinion…

Alex

Am 29.08.2012 um 12:49 schrieb Berk Hess:

> Hi,
> 
> I found the problem.
> When running PME on one node, a communicator was NULL instead of MPI_COMM_NULL.
> Find the fix below.
> 
> Thanks for reporting this and helping with the debugging.
> 
> Berk
> 
> index 735c0e8..2f5ce0a 100644
> --- a/src/mdlib/pme.c
> +++ b/src/mdlib/pme.c
> @@ -2076,6 +2076,10 @@ int gmx_pme_init(gmx_pme_t * pmedata,
> 
>     if (pme->nnodes == 1)
>     {
> +#ifdef GMX_MPI
> +        pme->mpi_comm_d[0] = MPI_COMM_NULL;
> +        pme->mpi_comm_d[1] = MPI_COMM_NULL;
> +#endif
>         pme->ndecompdim = 0;
>         pme->nodeid_major = 0;
>         pme->nodeid_minor = 0;
> 
> 
> On 08/29/2012 12:46 PM, Alexander Schlaich wrote:
>> Just an addition:
>> I just realized that only the case of running the MPI version on a single core seems affected. So this would correspond to invalid communicator in my previous mail.
>> 
>> Am 29.08.2012 um 12:17 schrieb Alexander Schlaich:
>> 
>>> Hi Berk,
>>> 
>>> your patch didn't fix the problem.
>>> Following the program execution with a debugger, I found the MPI error
>>> is thrown at src/mdlib/fft5d.c, line 196
>>> 
>>> fft5d_plan fft5d_plan_3d(int NG, int MG, int KG, MPI_Comm comm[2], int
>>> flags, t_complex** rlin, t_complex** rlout)
>>> 
>>> 192     /* comm, prank and P are in the order of the decomposition
>>> (plan->cart is in the order of transposes) */
>>> 193 #ifdef GMX_MPI
>>> 194     if (GMX_PARALLEL_ENV_INITIALIZED && comm[0] != MPI_COMM_NULL)
>>> 195     {
>>> 196  ->     MPI_Comm_size(comm[0],&P[0]);
>>> 197         MPI_Comm_rank(comm[0],&prank[0]);
>>> 198     }
>>> 199     else
>>> 
>>> It seems to me like the symbol MPI_COMM_NULL is not initialized (at
>>> least it is not zero like I expected). Adding a #define MPI_COMM_NULL 0
>>> (see the original discussion #931) solves all the problems, though I
>>> know this is not the solution.
>>> I think the MPI_COMM_NULL should be included by #include<mpi.h> and its
>>> type and value be handled by the MPI implementation? So I don't have an
>>> obvious fix...
>>> 
>>> 
>>> Thanks for your help,
>>> 
>>> Alex
>>> 
>>> 
>>> 
>>> Am Dienstag, den 28.08.2012, 22:41 +0200 schrieb Berk Hess:
>>>> Hi,
>>>> 
>>>> I think I might have found it already.
>>>> Could you try the fix below and report back if this solved the problem?
>>>> 
>>>> Cheers,
>>>> 
>>>> Berk
>>>> 
>>>> 
>>>> index 735c0e8..e00fa6f 100644
>>>> --- a/src/mdlib/pme.c
>>>> +++ b/src/mdlib/pme.c
>>>> @@ -1814,8 +1814,11 @@ static void init_atomcomm(gmx_pme_t
>>>> pme,pme_atomcomm_t *atc, t_commrec *cr,
>>>>      if (pme->nnodes > 1)
>>>>      {
>>>>          atc->mpi_comm = pme->mpi_comm_d[dimind];
>>>> -        MPI_Comm_size(atc->mpi_comm,&atc->nslab);
>>>> -        MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
>>>> +        if (atc->mpi_comm != MPI_COMM_NULL)
>>>> +        {
>>>> +            MPI_Comm_size(atc->mpi_comm,&atc->nslab);
>>>> +            MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
>>>> +        }
>>>>      }
>>>>      if (debug)
>>>>      {
>>>> 
>>>> 
>>>> 
>>>> On 08/28/2012 10:34 PM, Berk Hess wrote:
>>>>> Hi,
>>>>> 
>>>>> This seems to be a bug in Gromacs.
>>>>> As this is not in a Gromacs release yet, we could resolve this without
>>>>> a bug report.
>>>>> 
>>>>> A you skilled enough that you can run this in a debugger and tell me
>>>>> which MPI_comm_size
>>>>> call in Gromacs is causing this?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Berk
>>>>> 
>>>>> On 08/28/2012 07:39 PM, Alexander Schlaich wrote:
>>>>>> Dear Gromacs team,
>>>>>> 
>>>>>> I just tried to install the release-4.5.5_patches branch with
>>>>>> --enable-mpi on our cluster (OpemMPI-1.4.2), resulting in an error
>>>>>> when calling mdrun whith pme enabled:
>>>>>> 
>>>>>> Reading file topol.tpr, VERSION 4.5.5-dev-20120810-2859895 (single
>>>>>> precision)
>>>>>> [sheldon:22663] *** An error occurred in MPI_comm_size
>>>>>> [sheldon:22663] *** on communicator MPI_COMM_WORLD
>>>>>> [sheldon:22663] *** MPI_ERR_COMM: invalid communicator
>>>>>> [sheldon:22663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>>>> 
>>>>>> This seems to be related to a recent post on the list, however I
>>>>>> could not find a solution:
>>>>>> http://lists.gromacs.org/pipermail/gmx-users/2012-July/073316.html
>>>>>> However, the 4.5.5 release version works fine.
>>>>>> 
>>>>>> Taking a closer look I found commit
>>>>>> dcf8b67e2801f994dae56374382b9e330833de30, "changed PME MPI_Comm
>>>>>> comparisions to MPI_COMM_NULL, fixes #931" (Berk Hess). Apparently
>>>>>> here the communicators were changed such that the initialization
>>>>>> fails on my system. Reverting this single commit on the head of the
>>>>>> release-4.5.5 branch solved the issue for me.
>>>>>> 
>>>>>> As I am no MPI expert I would like to know if my MPI implementation
>>>>>> is misbehaving here, if I made a configuration mistake or if I should
>>>>>> file a bug report?
>>>>>> 
>>>>>> Thanks for your help,
>>>>>> 
>>>>>> Alex
>>> 
>>> -- 
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-developers-request at gromacs.org.
> 
> -- 
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-developers-request at gromacs.org.




More information about the gromacs.org_gmx-developers mailing list