Subject: | unchecked calls to k5_mutex_lock() interact poorly with finalizers |
We need to ensure that all calls to k5_mutex_lock() check the return
values, to avoid assertion failures when the lock attempt fails and
k5_mutex_unlock() then does its consistency checks. This can happen if
finalizers destroy the mutex before the call to k5_mutex_lock(). It is
also not safe to operate on shared data if k5_mutex_lock() fails, so we
should do something reasonable in that situation anyway.
From: Alexandra Ellwood <lxs@MIT.EDU>
Date: Thu, 15 May 2008 17:15:12 -0400
I think I have figured it out. Sadly it's hard to fix completely, but
there is a relatively easy fix which should reduce the number of these
reports.
So the key that tipped me off was looking at the full crash log.
Obviously there's our poor crashed thread which is calling assert
because it doesn't actually own the lock it just locked. Or, to be
precise, pthread_equal((&(&g_mechListLock)->os)->owner,
pthread_self()) is returning false:
Thread 16 Crashed:
0 libSystem.B.dylib 0x92e8e17a
usleep$NOCANCEL$UNIX2003 + 0
1 libSystem.B.dylib 0x92eaf9bb abort + 85
2 libSystem.B.dylib 0x92ea1593 __assert_rtn + 101
3 edu.mit.Kerberos 0x96bdb4d4 gssint_get_mechanism + 323
4 edu.mit.Kerberos 0x96b60508 gss_init_sec_context + 133
5 sncgss.dyld 0x0592b680
sapgss_init_sec_context + 779
6 libJPlatin.jnilib 0x12eed293 SncPEstablishContext + 321
7 libJPlatin.jnilib 0x12ef06aa SncPFrameIn + 7632
8 libJPlatin.jnilib 0x12ee426b SncProcessInput + 921
9 libJPlatin.jnilib 0x12edcd30 TmRead + 2321
10 libJPlatin.jnilib 0x12e5a742
AgCom::Read(unsigned char*, int, int*) + 154
11 libJPlatin.jnilib 0x12e5a890
AgCom::CsRead(unsigned char*, int, int*, int) + 146
12 libJPlatin.jnilib 0x12d9e73e
AgComAdaptor::read(unsigned char**, int*) + 196
13 libJPlatin.jnilib 0x12d9b674
Java_com_sap_platin_r3_protocol_ni_JniAgComAdaptor_jniRead + 58
14 ??? 0x035859b1 0 + 56121777
15 ??? 0x0357fa54 0 + 56097364
16 ??? 0x0357faaa 0 + 56097450
17 ??? 0x0357d227 0 + 56087079
18 libjvm.dylib 0x004d263a 0x3b9000 + 1152570
19 libjvm.dylib 0x00416763 0x3b9000 + 382819
20 libjvm.dylib 0x00416655 0x3b9000 + 382549
21 libjvm.dylib 0x00416595 0x3b9000 + 382357
22 libjvm.dylib 0x004164b8 0x3b9000 + 382136
23 libjvm.dylib 0x0067dcc5 JVM_RaiseSignal + 441189
24 libSystem.B.dylib 0x92e3753b _pthread_body + 27
However, scrolling up a bit we find this thread:
Thread 3:
0 libSystem.B.dylib 0x92dba9e6 mach_msg_trap + 10
1 libSystem.B.dylib 0x92dc21dc mach_msg + 72
2 libSystem.B.dylib 0x92dc1fe0 vproc_mig_look_up2 + 198
3 libSystem.B.dylib 0x92dc1ea0 bootstrap_look_up2 + 73
4 libSystem.B.dylib 0x92dc1e55 bootstrap_look_up + 55
5 edu.mit.Kerberos 0x96b6ffd4
kipc_client_lookup_server + 327
6 edu.mit.Kerberos 0x96b693ed cci_os_ipc + 264
7 edu.mit.Kerberos 0x96b691ba _cci_ipc_send + 221
8 edu.mit.Kerberos 0x96b692bf
cci_ipc_send_no_launch + 39
9 edu.mit.Kerberos 0x96b65b39 cci_context_sync + 181
10 edu.mit.Kerberos 0x96b66bce ccapi_context_release + 33
11 edu.mit.Kerberos 0x96b62a07 krb5_stdcc_shutdown + 30
12 edu.mit.Kerberos 0x96b4eb51 krb5int_lib_fini + 267
13 dyld 0x8fe12fc3
ImageLoaderMachO::doTermination(ImageLoader::LinkContext const&) + 227
14 dyld 0x8fe04f7b
dyld::runTerminators(void*) + 155
15 libSystem.B.dylib 0x92de353c __cxa_finalize + 241
16 libSystem.B.dylib 0x92de3430 exit + 33
17 libjvm.dylib 0x00473ba3 0x3b9000 + 764835
18 libjvm.dylib 0x00473b02 0x3b9000 + 764674
19 libjvm.dylib 0x004739c8 0x3b9000 + 764360
20 libjvm.dylib 0x0041603d 0x3b9000 + 380989
21 libjvm.dylib 0x00415d57 0x3b9000 + 380247
22 libjvm.dylib 0x0067dcc5 JVM_RaiseSignal + 441189
23 libSystem.B.dylib 0x92e3753b _pthread_body + 27
Uh oh. SAPGUI is in the middle of exiting and is running the library
finalizer for the krb5 library. Unfortunately this probably isn't the
only library finalizer it has run. Chances are good that this thread
has already run the finalizer for the GSS library...
After talking to Ken Raeburn, here's what we think is happening:
Thread 16 enters gssint_get_mechanism() and calls
gssint_initialize_library(). This function verifies the library is
loaded and g_mechListLock is valid.
Meanwhile the application is trying to quit due to some error or user
action. Thread 3 calls exit() which calls dyld::runTerminators().
This runs the GSS library finalizer which calls
pthread_mutex_destroy() on g_mechListLock, invalidating the lock which
thread 16 is about to use. Thread 3 then proceeds on to the krb5
library finalizer which blocks on a Mach-IPC call to the CCacheServer.
Meanwhile, thread 16 calls calls k5_mutex_lock() which returns an
error because g_mechListLock has been destroyed by thread 3.
Unfortunately, gssint_get_mechanism() discards the error from
k5_mutex_lock() and continues. When it calls k5_mutex_unlock() the
owner is not set (because g_mechListLock was not locked) and thus we
get an assertion failure.
Now this particular problem can be easily fixed by making sure our
code checks the return value of k5_mutex_lock() everywhere it is
called and does not call k5_mutex_unlock() if k5_mutex_lock() fails.
After making this change there are probably still similar race
conditions which need to be analyzed. I want to look at all of the
crash logs to make sure there are not other common scenarios which
cause similar race conditions.
Note that just ignoring the already-destroyed mutex isn't really safe
because we could end up garbling memory and potentially spewing it out
on the network before the application manages to exit. So just
removing the assert() calls isn't really a good idea. Application
crash on exit is bad, but better than a security hole. :-)
--lxs
Alexandra Ellwood <lxs@mit.edu>
MIT Kerberos Consortium
<http://mit.edu/lxs/www>
values, to avoid assertion failures when the lock attempt fails and
k5_mutex_unlock() then does its consistency checks. This can happen if
finalizers destroy the mutex before the call to k5_mutex_lock(). It is
also not safe to operate on shared data if k5_mutex_lock() fails, so we
should do something reasonable in that situation anyway.
From: Alexandra Ellwood <lxs@MIT.EDU>
Date: Thu, 15 May 2008 17:15:12 -0400
I think I have figured it out. Sadly it's hard to fix completely, but
there is a relatively easy fix which should reduce the number of these
reports.
So the key that tipped me off was looking at the full crash log.
Obviously there's our poor crashed thread which is calling assert
because it doesn't actually own the lock it just locked. Or, to be
precise, pthread_equal((&(&g_mechListLock)->os)->owner,
pthread_self()) is returning false:
Thread 16 Crashed:
0 libSystem.B.dylib 0x92e8e17a
usleep$NOCANCEL$UNIX2003 + 0
1 libSystem.B.dylib 0x92eaf9bb abort + 85
2 libSystem.B.dylib 0x92ea1593 __assert_rtn + 101
3 edu.mit.Kerberos 0x96bdb4d4 gssint_get_mechanism + 323
4 edu.mit.Kerberos 0x96b60508 gss_init_sec_context + 133
5 sncgss.dyld 0x0592b680
sapgss_init_sec_context + 779
6 libJPlatin.jnilib 0x12eed293 SncPEstablishContext + 321
7 libJPlatin.jnilib 0x12ef06aa SncPFrameIn + 7632
8 libJPlatin.jnilib 0x12ee426b SncProcessInput + 921
9 libJPlatin.jnilib 0x12edcd30 TmRead + 2321
10 libJPlatin.jnilib 0x12e5a742
AgCom::Read(unsigned char*, int, int*) + 154
11 libJPlatin.jnilib 0x12e5a890
AgCom::CsRead(unsigned char*, int, int*, int) + 146
12 libJPlatin.jnilib 0x12d9e73e
AgComAdaptor::read(unsigned char**, int*) + 196
13 libJPlatin.jnilib 0x12d9b674
Java_com_sap_platin_r3_protocol_ni_JniAgComAdaptor_jniRead + 58
14 ??? 0x035859b1 0 + 56121777
15 ??? 0x0357fa54 0 + 56097364
16 ??? 0x0357faaa 0 + 56097450
17 ??? 0x0357d227 0 + 56087079
18 libjvm.dylib 0x004d263a 0x3b9000 + 1152570
19 libjvm.dylib 0x00416763 0x3b9000 + 382819
20 libjvm.dylib 0x00416655 0x3b9000 + 382549
21 libjvm.dylib 0x00416595 0x3b9000 + 382357
22 libjvm.dylib 0x004164b8 0x3b9000 + 382136
23 libjvm.dylib 0x0067dcc5 JVM_RaiseSignal + 441189
24 libSystem.B.dylib 0x92e3753b _pthread_body + 27
However, scrolling up a bit we find this thread:
Thread 3:
0 libSystem.B.dylib 0x92dba9e6 mach_msg_trap + 10
1 libSystem.B.dylib 0x92dc21dc mach_msg + 72
2 libSystem.B.dylib 0x92dc1fe0 vproc_mig_look_up2 + 198
3 libSystem.B.dylib 0x92dc1ea0 bootstrap_look_up2 + 73
4 libSystem.B.dylib 0x92dc1e55 bootstrap_look_up + 55
5 edu.mit.Kerberos 0x96b6ffd4
kipc_client_lookup_server + 327
6 edu.mit.Kerberos 0x96b693ed cci_os_ipc + 264
7 edu.mit.Kerberos 0x96b691ba _cci_ipc_send + 221
8 edu.mit.Kerberos 0x96b692bf
cci_ipc_send_no_launch + 39
9 edu.mit.Kerberos 0x96b65b39 cci_context_sync + 181
10 edu.mit.Kerberos 0x96b66bce ccapi_context_release + 33
11 edu.mit.Kerberos 0x96b62a07 krb5_stdcc_shutdown + 30
12 edu.mit.Kerberos 0x96b4eb51 krb5int_lib_fini + 267
13 dyld 0x8fe12fc3
ImageLoaderMachO::doTermination(ImageLoader::LinkContext const&) + 227
14 dyld 0x8fe04f7b
dyld::runTerminators(void*) + 155
15 libSystem.B.dylib 0x92de353c __cxa_finalize + 241
16 libSystem.B.dylib 0x92de3430 exit + 33
17 libjvm.dylib 0x00473ba3 0x3b9000 + 764835
18 libjvm.dylib 0x00473b02 0x3b9000 + 764674
19 libjvm.dylib 0x004739c8 0x3b9000 + 764360
20 libjvm.dylib 0x0041603d 0x3b9000 + 380989
21 libjvm.dylib 0x00415d57 0x3b9000 + 380247
22 libjvm.dylib 0x0067dcc5 JVM_RaiseSignal + 441189
23 libSystem.B.dylib 0x92e3753b _pthread_body + 27
Uh oh. SAPGUI is in the middle of exiting and is running the library
finalizer for the krb5 library. Unfortunately this probably isn't the
only library finalizer it has run. Chances are good that this thread
has already run the finalizer for the GSS library...
After talking to Ken Raeburn, here's what we think is happening:
Thread 16 enters gssint_get_mechanism() and calls
gssint_initialize_library(). This function verifies the library is
loaded and g_mechListLock is valid.
Meanwhile the application is trying to quit due to some error or user
action. Thread 3 calls exit() which calls dyld::runTerminators().
This runs the GSS library finalizer which calls
pthread_mutex_destroy() on g_mechListLock, invalidating the lock which
thread 16 is about to use. Thread 3 then proceeds on to the krb5
library finalizer which blocks on a Mach-IPC call to the CCacheServer.
Meanwhile, thread 16 calls calls k5_mutex_lock() which returns an
error because g_mechListLock has been destroyed by thread 3.
Unfortunately, gssint_get_mechanism() discards the error from
k5_mutex_lock() and continues. When it calls k5_mutex_unlock() the
owner is not set (because g_mechListLock was not locked) and thus we
get an assertion failure.
Now this particular problem can be easily fixed by making sure our
code checks the return value of k5_mutex_lock() everywhere it is
called and does not call k5_mutex_unlock() if k5_mutex_lock() fails.
After making this change there are probably still similar race
conditions which need to be analyzed. I want to look at all of the
crash logs to make sure there are not other common scenarios which
cause similar race conditions.
Note that just ignoring the already-destroyed mutex isn't really safe
because we could end up garbling memory and potentially spewing it out
on the network before the application manages to exit. So just
removing the assert() calls isn't really a good idea. Application
crash on exit is bad, but better than a security hole. :-)
--lxs
Alexandra Ellwood <lxs@mit.edu>
MIT Kerberos Consortium
<http://mit.edu/lxs/www>