[gmx-developers] next Gromacs release

Berk Hess hess at cbr.su.se
Mon Jun 14 18:22:59 CEST 2010


Berk Hess wrote:
> David van der Spoel wrote:
>   
>> On 2010-06-14 15.13, Carsten Kutzner wrote:
>>     
>>> On Jun 12, 2010, at 12:14 PM, Carsten Kutzner wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> I have noticed that with some MPI implementations (Intel MPI, IBM's poe
>>>> and most likely also MPICH2) the g_tune_pme tool sometimes gets
>>>> stuck after having successfully done a part of the test runs.
>>>>
>>>> This happens for cases where mdrun (in init_domain_decomposition)
>>>> cannot
>>>> find a suitable decomposition and shortly afterwards calls MPI_Abort
>>>> via
>>>> gmx_fatal. Some (buggy!?) MPI implementations cannot guaratee that
>>>> all MPI
>>>> processes are cleanly cancelled after a call to MPI_Abort and in
>>>> those cases the
>>>> control is never given back to g_tune_pme, which then thinks mdrun
>>>> is still running.
>>>>
>>>> My question is, do we really have to call gmx_fatal when no suitable
>>>> dd can
>>>> be found? At that point, all MPI processes are still living and we
>>>> could finish mdrun
>>>> cleanly with an MPI_Finalize (just as in successful runs), thus
>>>> omitting the hangs in
>>>> the tuning utility. I think that also normal mdruns, when unable to
>>>> find a dd grid,
>>>> would for those MPI's end as zombies until they eventually get
>>>> killed by the queueing
>>>> system.
>>>>         
>>> Any comments?
>>>
>>> Should I check in a patch?
>>>
>>> Carsten
>>>
>>>       
>> This is of course a special case that can be expected to happen. It
>> would be much nicer to fix gmx_fatal, but how?
>>
>>     
> We replaced all relevant gmx_fatal calls in init_domain_decomposition by
> gmx_fatal_collective.
> The only issue is that this still calls gmx_fatal on the master which
> call mpi_abort.
> I now made an error handler that does not call gmx_abort and later call
> MPI_Finalize.
> But I just realized that there is still a problem when mdrun -multi is used.
> Since not all processes will then call gmx_fatal_collective, mdrun will
> hang.
> I guess we need to pass the commrec to gmx_fatal_collective and
> MPI_Abort as usual
> when we use multisim.
>
> Berk
>
>   
I have about fixed this now.
But I'll wait with committing until Sander has MPI_Comm_compare implemented
in the MPI thread library.

Berk






More information about the gromacs.org_gmx-developers mailing list