[gmx-developers] next Gromacs release
Berk Hess
hess at cbr.su.se
Mon Jun 14 16:17:28 CEST 2010
David van der Spoel wrote:
> On 2010-06-14 15.13, Carsten Kutzner wrote:
>> On Jun 12, 2010, at 12:14 PM, Carsten Kutzner wrote:
>>
>>> Hi,
>>>
>>> I have noticed that with some MPI implementations (Intel MPI, IBM's poe
>>> and most likely also MPICH2) the g_tune_pme tool sometimes gets
>>> stuck after having successfully done a part of the test runs.
>>>
>>> This happens for cases where mdrun (in init_domain_decomposition)
>>> cannot
>>> find a suitable decomposition and shortly afterwards calls MPI_Abort
>>> via
>>> gmx_fatal. Some (buggy!?) MPI implementations cannot guaratee that
>>> all MPI
>>> processes are cleanly cancelled after a call to MPI_Abort and in
>>> those cases the
>>> control is never given back to g_tune_pme, which then thinks mdrun
>>> is still running.
>>>
>>> My question is, do we really have to call gmx_fatal when no suitable
>>> dd can
>>> be found? At that point, all MPI processes are still living and we
>>> could finish mdrun
>>> cleanly with an MPI_Finalize (just as in successful runs), thus
>>> omitting the hangs in
>>> the tuning utility. I think that also normal mdruns, when unable to
>>> find a dd grid,
>>> would for those MPI's end as zombies until they eventually get
>>> killed by the queueing
>>> system.
>> Any comments?
>>
>> Should I check in a patch?
>>
>> Carsten
>>
>
> This is of course a special case that can be expected to happen. It
> would be much nicer to fix gmx_fatal, but how?
>
We replaced all relevant gmx_fatal calls in init_domain_decomposition by
gmx_fatal_collective.
The only issue is that this still calls gmx_fatal on the master which
call mpi_abort.
I now made an error handler that does not call gmx_abort and later call
MPI_Finalize.
But I just realized that there is still a problem when mdrun -multi is used.
Since not all processes will then call gmx_fatal_collective, mdrun will
hang.
I guess we need to pass the commrec to gmx_fatal_collective and
MPI_Abort as usual
when we use multisim.
Berk
More information about the gromacs.org_gmx-developers
mailing list