[gmx-developers] MPI_ERR_COMM on 4.5.5-patches
Berk Hess
hess at kth.se
Wed Aug 29 12:49:06 CEST 2012
Hi,
I found the problem.
When running PME on one node, a communicator was NULL instead of
MPI_COMM_NULL.
Find the fix below.
Thanks for reporting this and helping with the debugging.
Berk
index 735c0e8..2f5ce0a 100644
--- a/src/mdlib/pme.c
+++ b/src/mdlib/pme.c
@@ -2076,6 +2076,10 @@ int gmx_pme_init(gmx_pme_t * pmedata,
if (pme->nnodes == 1)
{
+#ifdef GMX_MPI
+ pme->mpi_comm_d[0] = MPI_COMM_NULL;
+ pme->mpi_comm_d[1] = MPI_COMM_NULL;
+#endif
pme->ndecompdim = 0;
pme->nodeid_major = 0;
pme->nodeid_minor = 0;
On 08/29/2012 12:46 PM, Alexander Schlaich wrote:
> Just an addition:
> I just realized that only the case of running the MPI version on a single core seems affected. So this would correspond to invalid communicator in my previous mail.
>
> Am 29.08.2012 um 12:17 schrieb Alexander Schlaich:
>
>> Hi Berk,
>>
>> your patch didn't fix the problem.
>> Following the program execution with a debugger, I found the MPI error
>> is thrown at src/mdlib/fft5d.c, line 196
>>
>> fft5d_plan fft5d_plan_3d(int NG, int MG, int KG, MPI_Comm comm[2], int
>> flags, t_complex** rlin, t_complex** rlout)
>>
>> 192 /* comm, prank and P are in the order of the decomposition
>> (plan->cart is in the order of transposes) */
>> 193 #ifdef GMX_MPI
>> 194 if (GMX_PARALLEL_ENV_INITIALIZED && comm[0] != MPI_COMM_NULL)
>> 195 {
>> 196 -> MPI_Comm_size(comm[0],&P[0]);
>> 197 MPI_Comm_rank(comm[0],&prank[0]);
>> 198 }
>> 199 else
>>
>> It seems to me like the symbol MPI_COMM_NULL is not initialized (at
>> least it is not zero like I expected). Adding a #define MPI_COMM_NULL 0
>> (see the original discussion #931) solves all the problems, though I
>> know this is not the solution.
>> I think the MPI_COMM_NULL should be included by #include<mpi.h> and its
>> type and value be handled by the MPI implementation? So I don't have an
>> obvious fix...
>>
>>
>> Thanks for your help,
>>
>> Alex
>>
>>
>>
>> Am Dienstag, den 28.08.2012, 22:41 +0200 schrieb Berk Hess:
>>> Hi,
>>>
>>> I think I might have found it already.
>>> Could you try the fix below and report back if this solved the problem?
>>>
>>> Cheers,
>>>
>>> Berk
>>>
>>>
>>> index 735c0e8..e00fa6f 100644
>>> --- a/src/mdlib/pme.c
>>> +++ b/src/mdlib/pme.c
>>> @@ -1814,8 +1814,11 @@ static void init_atomcomm(gmx_pme_t
>>> pme,pme_atomcomm_t *atc, t_commrec *cr,
>>> if (pme->nnodes > 1)
>>> {
>>> atc->mpi_comm = pme->mpi_comm_d[dimind];
>>> - MPI_Comm_size(atc->mpi_comm,&atc->nslab);
>>> - MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
>>> + if (atc->mpi_comm != MPI_COMM_NULL)
>>> + {
>>> + MPI_Comm_size(atc->mpi_comm,&atc->nslab);
>>> + MPI_Comm_rank(atc->mpi_comm,&atc->nodeid);
>>> + }
>>> }
>>> if (debug)
>>> {
>>>
>>>
>>>
>>> On 08/28/2012 10:34 PM, Berk Hess wrote:
>>>> Hi,
>>>>
>>>> This seems to be a bug in Gromacs.
>>>> As this is not in a Gromacs release yet, we could resolve this without
>>>> a bug report.
>>>>
>>>> A you skilled enough that you can run this in a debugger and tell me
>>>> which MPI_comm_size
>>>> call in Gromacs is causing this?
>>>>
>>>> Cheers,
>>>>
>>>> Berk
>>>>
>>>> On 08/28/2012 07:39 PM, Alexander Schlaich wrote:
>>>>> Dear Gromacs team,
>>>>>
>>>>> I just tried to install the release-4.5.5_patches branch with
>>>>> --enable-mpi on our cluster (OpemMPI-1.4.2), resulting in an error
>>>>> when calling mdrun whith pme enabled:
>>>>>
>>>>> Reading file topol.tpr, VERSION 4.5.5-dev-20120810-2859895 (single
>>>>> precision)
>>>>> [sheldon:22663] *** An error occurred in MPI_comm_size
>>>>> [sheldon:22663] *** on communicator MPI_COMM_WORLD
>>>>> [sheldon:22663] *** MPI_ERR_COMM: invalid communicator
>>>>> [sheldon:22663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>>>
>>>>> This seems to be related to a recent post on the list, however I
>>>>> could not find a solution:
>>>>> http://lists.gromacs.org/pipermail/gmx-users/2012-July/073316.html
>>>>> However, the 4.5.5 release version works fine.
>>>>>
>>>>> Taking a closer look I found commit
>>>>> dcf8b67e2801f994dae56374382b9e330833de30, "changed PME MPI_Comm
>>>>> comparisions to MPI_COMM_NULL, fixes #931" (Berk Hess). Apparently
>>>>> here the communicators were changed such that the initialization
>>>>> fails on my system. Reverting this single commit on the head of the
>>>>> release-4.5.5 branch solved the issue for me.
>>>>>
>>>>> As I am no MPI expert I would like to know if my MPI implementation
>>>>> is misbehaving here, if I made a configuration mistake or if I should
>>>>> file a bug report?
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>> Alex
>>
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers
mailing list