[gmx-developers] Re: gmx_fatal deadlock bug

Szilárd Páll szilard.pall at cbr.su.se
Fri Jan 29 12:36:37 CET 2010


Hi,

> The fix looks fine; the only weird thing I see is the 'if (msg==NULL)' check in _gmx_error.

That's indeed weird and the reason for that if is to avoid another
deadlock. As both print_warn_num() and _gmx_error() are using warn_buf
which is a global resource, I made the both use the same mutex.
However, this yields a deadlock again as the latter is called from
gmx_fatal() which is called by the former (see below).

#0  __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:130
#1  0x00007f55cb1f4190 in _L_lock_102 () from /lib/libpthread.so.0
#2  0x00007f55cb1f3a7e in __pthread_mutex_lock (mutex=0x8610e0) at
pthread_mutex_lock.c:86
#3  0x000000000053b803 in _gmx_error (key=0x60c9b6 "fatal", msg=0x80
<Address 0x80 out of bounds>,
    file=0x7f55cb1fbfc0..., line=-1) at
/home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c:781
#4  0x000000000053c03c in gmx_fatal (f_errno=0,
    file=0x60cc70
"/home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c",
line=593,
    fmt=0x210a010
"/home/pszilard/Work/Projects/gmx/gmx-master/build_cmake/src/kernel/grompp")
    at /home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c:455
#5  0x000000000053c7d7 in print_warn_num (bFatalError=1)
    at /home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c:593
#6  0x000000000041aad6 in main (argc=1, argv=<value optimized out>)
    at /home/pszilard/Projects/gmx/gmx-master/src/kernel/grompp.c:1277

As gmx_fatal alwas passes a non-empty msg the test handles this case,
however there might be other places from where _gmx_error() is called
with non-empty msg but no lock on warning_mutex...


> I haven't seen gmx_fatal deadlock yet: what triggered it?

Essentially print_wan_num that calls gmx_fatal were both using the
same mutex - although accessing different resources.

Here's the backtrace:

#0  __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:130
#1  0x00007f56d896e190 in _L_lock_102 () from /lib/libpthread.so.0
#2  0x00007f56d896da7e in __pthread_mutex_lock (mutex=0x8610e0) at
pthread_mutex_lock.c:86
#3  0x000000000053beb5 in gmx_fatal (f_errno=0, file=0x80 <Address
0x80 out of bounds>, line=591,
    fmt=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at /home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c:268
#4  0x000000000053c7d7 in print_warn_num (bFatalError=1)
    at /home/pszilard/Projects/gmx/gmx-master/src/gmxlib/gmx_fatal.c:591
#5  0x000000000041aad6 in main (argc=1, argv=<value optimized out>)
    at /home/pszilard/Projects/gmx/gmx-master/src/kernel/grompp.c:1277

> In general, gmx_fatal.c and futil.c contain many ugly hacks that need to go away. Especially futil.c with its dependence on a global list of open files/pipes, and its interlocking function calls, is a constant source of deadlocks or thread safety issues whenever someone wants to change something. The only real way to fix this is to change the interface to the rest of the code.
> The sheer amount of work involved in changing APIs that are called by most of the code in Gromacs has kept me from doing it now, however. Perhaps it's best to wait for the 5.0 branch.

I can't comment much except that it would be really good, if global
variables are really necessary, to have one mutex per global resource
and maybe even one pair of files (get/set style) that actually have
access to these (and therefor need to lock).

I do realize though that in case if we start to move toward C++ there
are better design schemes for this.

--
Szilárd

> On Jan 28, 2010, at 20:05 , Szilárd Páll wrote:
>
>> Hi,
>>
>> I have recently committed a bugfix for gmx_fatal.c that fixes a
>> deadlock we traced and fixed with Berk this afternoon. Basically the
>> debug_mutex (which, to be honest, I don't know what exactly is) was
>> used in locking more then one resource in different functions that
>> happened to call each other.
>>
>> The reason I am writing is that there might still be some situations
>> in which problems might occur and it seems that gmx_fatal would need a
>> bit of checking and rewriting. I am not so familiar with the code so I
>> thought I let you know about the issue; I also left a couple of
>> comment where I was not sure what to do.
>>
>> Best regards,
>> --
>> Szilárd
>
>



More information about the gromacs.org_gmx-developers mailing list