Skip Menu |
 

To: krb5-bugs@mit.edu
Date: Thu, 9 Jan 2020 11:40:30 -0500
Subject: krb5int_key_delete: Assertion `destructors_set[keynum] == 1' failed.
From: "Spencer Malone" <malone.spencer@gmail.com>
Heyo! In our apache/httpd instances, we're regularly seeing the following cause segfaults:

httpd: threads.c:395: krb5int_key_delete: Assertion `destructors_set[keynum] == 1' failed.
We don't directly utilize krb5, but the library is pulled in by a few transient dependencies.
After digging around on the net, it seems we aren't alone: https://stackoverflow.com/questions/54213685/apache-seg-fault-krb5int-key-delete-assertion-destructors-setkeynum-1-fail

I tried to dump all apache modules that could be loading libkrb5 and came up with...
/etc/httpd/modules/libphp7.so
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f4f35f65000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f4f34839000)
/etc/httpd/modules/libphp7-zts.so
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007ff228114000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007ff226c04000)
/etc/httpd/modules/mod_ssl.so
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f0c0c7fa000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f0c0bf9d000)
Any thoughts on a potential solution or cause?
We do see rare reports of assertion failures in the krb5int_key functions, which handle an internal table of thread-specific data keys.

Ticket 8614 (which you linked to from the stackoverflow answer) happens because krb5int_key_register() is called on a key that is marked as already registered.  A candidate explanation there is two different versions of libgssapi_krb5 in the same process, both calling into the same libkrb5support, although I'm not sure that's right--although it's easy enough to have multiple versions of libgssapi_krb5 installed on a machine, they should all have the same soname (since we haven't changed it in a long time), which I think would make it difficult to load more than one version into a process.

The failure reported here is the inverse: krb5int_key_delete() is called on a key that isn't marked as registered.  krb5int_key_delete() is invoked from the library finalizer of libgssapi_krb5 (also the finalizer of the krb5 version of libcom_err, but typically the e2fsprogs version of com_err is used on Linux).  Although it's possible for the finalizer to run without the initializer having run, there is a check for that.  So I don't have any good theories.
Subject: Re: [krbdev.mit.edu #8863] krb5int_key_delete: Assertion `destructors_set[keynum] == 1' failed.
From: "Spencer Malone" <malone.spencer@gmail.com>
Date: Thu, 9 Jan 2020 23:13:31 -0500
To: rt@krbdev.mit.edu
Download (untitled) / with headers
text/plain 2.4KiB
Ah, bummer! Was hoping for something easy. The last bit of context I can provide that I realized was absent in my initial report is that from the apache side, we see this most often (all of the time? Difficult to say that with certainty) while telling apache to do a "graceful reload or restart", which involves a managed ramp down / killing of child processes, and those children that are throwing this.

Throwing a hypothesis out there without much context on this project, if I'm reading your message correctly: potentially https://github.com/krb5/krb5/blob/81e47875e3de0e52fbb11d61ef30a9406497af73/src/lib/gssapi/krb5/gssapi_krb5.c#L1117-L1119 is a good place to look? If so, I'm wondering if the init function's registration of those variables (https://github.com/krb5/krb5/blob/81e47875e3de0e52fbb11d61ef30a9406497af73/src/lib/gssapi/krb5/gssapi_krb5.c#L1072-L1081) could either be interrupted (in my head I'm imagining by a sigkill?), or have thrown an error (leading to some of the keys being unregistered)

I'm very rarely handing out in c-land, so the idea of interrupted execution may be completely off base, but the error handling still stands out to me as having potential?

On Thu, Jan 9, 2020 at 8:01 PM Greg Hudson via RT <rt@krbdev.mit.edu> wrote:
Show quoted text
We do see rare reports of assertion failures in the krb5int_key functions,
which handle an internal table of thread-specific data keys.

Ticket 8614 (which you linked to from the stackoverflow answer) happens because
krb5int_key_register() is called on a key that is marked as already registered.
A candidate explanation there is two different versions of libgssapi_krb5 in
the same process, both calling into the same libkrb5support, although I'm not
sure that's right--although it's easy enough to have multiple versions of
libgssapi_krb5 installed on a machine, they should all have the same soname
(since we haven't changed it in a long time), which I think would make it
difficult to load more than one version into a process.

The failure reported here is the inverse: krb5int_key_delete() is called on a
key that isn't marked as registered. krb5int_key_delete() is invoked from the
library finalizer of libgssapi_krb5 (also the finalizer of the krb5 version of
libcom_err, but typically the e2fsprogs version of com_err is used on Linux).
Although it's possible for the finalizer to run without the initializer having
run, there is a check for that. So I don't have any good theories.


There are some theoretically possible execution paths along those lines that could lead to the assertion failure, but they don't seem likely to be the cause of the assertion failures you're seeing.

I don't think the interruption scenario is plausible.  SIGKILL would abort the process immediately (no library unloading), as would any unhandled signal.  If a signal arrived, was handled, and the handler returned, the initializer would continue running and would finish registering all of the keys.  Signal handlers are not allowed to call exit().  Signal handlers are allowed to call _exit(), but that would bypass library unloading.  On top of that, the "graceful reload or restart" operation would not seem likely to send a signal to processes while they are in the middle of initializing the GSSAPI library.

The failed initializer scenario actually kind of tracks, because of a bug in gssint_mechglue_init() ( https://github.com/krb5/krb5/blob/master/src/lib/gssapi/mechglue/g_initialize.c#L106 ) where error values can be ignored.  (If errors were correctly handled, the INITIALIZER_RAN() guard in gssint_mechglue_fini() would not consider the initializer to have run because it returned an error.)  But I don't think this is a likely scenario for two reasons.  First, you say you're not using krb5, and gssint_mechglue_init() doesn't actually run until a GSSAPI function is invoked.  (It's possible that the PHP module invokes a GSSAPI function when it starts up, I guess.)  Second, failures inside gss_krb5int_lib_init() should be vanishingly uncommon; one doesn't really expect mutex initialization to fail.

I will fix the bug in gssint_mechglue_init(), but to diagnose the actual problem with any confidence, either I need to be able to reproduce the problem, or someone who reliably sees the problem needs to debug it in situ, likely by adding a bunch of instrumentation to the initializer and finalizer code.
 
Date: Fri, 10 Jan 2020 07:59:14 -0500
To: rt@krbdev.mit.edu
Subject: Re: [krbdev.mit.edu #8863] krb5int_key_delete: Assertion `destructors_set[keynum] == 1' failed.
From: "Spencer Malone" <malone.spencer@gmail.com>
Download (untitled) / with headers
text/plain 2.1KiB
Ok, I'm happy to take on trying to create a way to reliably reproduce the problem. Thanks for the help, will cycle back when I have something.

On Fri, Jan 10, 2020, 2:26 AM Greg Hudson via RT <rt@krbdev.mit.edu> wrote:
Show quoted text
There are some theoretically possible execution paths along those lines that
could lead to the assertion failure, but they don't seem likely to be the cause
of the assertion failures you're seeing.

I don't think the interruption scenario is plausible. SIGKILL would abort the
process immediately (no library unloading), as would any unhandled signal. If a
signal arrived, was handled, and the handler returned, the initializer would
continue running and would finish registering all of the keys. Signal handlers
are not allowed to call exit(). Signal handlers are allowed to call _exit(),
but that would bypass library unloading. On top of that, the "graceful reload
or restart" operation would not seem likely to send a signal to processes while
they are in the middle of initializing the GSSAPI library.

The failed initializer scenario actually kind of tracks, because of a bug in
gssint_mechglue_init() (
https://github.com/krb5/krb5/blob/master/src/lib/gssapi/mechglue/g_initialize.c#L106
) where error values can be ignored. (If errors were correctly handled, the
INITIALIZER_RAN() guard in gssint_mechglue_fini() would not consider the
initializer to have run because it returned an error.) But I don't think this
is a likely scenario for two reasons. First, you say you're not using krb5, and
gssint_mechglue_init() doesn't actually run until a GSSAPI function is invoked.
(It's possible that the PHP module invokes a GSSAPI function when it starts up,
I guess.) Second, failures inside gss_krb5int_lib_init() should be vanishingly
uncommon; one doesn't really expect mutex initialization to fail.

I will fix the bug in gssint_mechglue_init(), but to diagnose the actual
problem with any confidence, either I need to be able to reproduce the problem,
or someone who reliably sees the problem needs to debug it in situ, likely by
adding a bunch of instrumentation to the initializer and finalizer code.