Skip Menu |
 

From: Charles Hedrick <hedrick@rutgers.edu>
To: "krb5-bugs@mit.edu" <krb5-bugs@mit.edu>
Subject: missing primary cache after kdestroy
Date: Fri, 3 Mar 2017 21:58:41 +0000
Submitter-Id: hedrick
Originator: Charles Hedrick
Organization: Rutgers University
Confidential: no
Synopsis:primary cache sometimes missing
Severity: critical
Priority: medium
Release: 1.14
Environment: Centos 7

[I should note that I haven’t actually seen NFS fail as described below. Rather, I’ve been testing the way credentials caches work carefully before deploying Kerberized NFS.]

In Kerberized NFS on Centos, GSS uses the primary credentials cache. If that cache becomes unusable, jobs can lose access to NFS.

The code in cc_keyring.c, get_primary is intended to set up a primary cache when cc_resolve is called to find the default cache. Unfortunately all it does is check whether the collection has designated a primary. It doesn’t check that the cache referred to actually exists and is usable. If you do a kdestroy of the primary cache (and I think also if the cache expires) the collection can have a primary pointing to a cache that no longer exists.

This means that even an auto renewal program such as k5start may not be good enough to make sure that a job always has NFS access. NFS doesn’t pay attention to KRB5CCNAME. It simply calls GSS with arguments that cause it to look for the default cache. If that cache doesn’t exist, GSS will fail, even though KRB5CCNAME for the job may point to a usable cache.

It may actually be safest not to use the keyring at all, but to use files in /tmp. gssd will try to use GSS to find a usable cache, but if it fails, it will search /tmp. Because /tmp doesn’t have a concept of primary, if checks all caches. That allows multiple sessions not to interfere with each other, and NFS to work if any session has a valid cache.

I’d like to see get_primary smartened so that it checks to see whether the primary cache is real, and if not, sets the primary to something usable. You could try to fix it by making the destroy operation remove the primary pointer if it’s destroying the primary. But that won’t handle expirations, which happen in the kernel without activating any code in the Kerberos library.
It's intentional that a collection might be non-empty but its primary
cache pointer might point to an empty or expired cache. Having the
primary pointer snap to an arbitrarily chosen cache in the collection
would be surprising, I think.

I agree that it might be better if gssd could know something about the
environment of the process invoking the filesystem operation, so that
cron jobs could use a cache that isn't shared with user login
sessions. But I don't see a good way to work around that limitation
within the krb5 library. gssd could search the default cache
collection for a usable cache in preference to searching files in
/tmp, but that's still not completely satisfying.

I believe that Red Hat is working on implementing a KCM server in sssd
to replace their use of the kernel keyring cache, but I don't know if
it will directly solve this issue because it still won't isolate a
long-running job from a short-term user login session from gssd's
point of view.
From: Charles Hedrick <hedrick@rutgers.edu>
To: "rt-comment@krbdev.mit.edu" <rt-comment@krbdev.mit.edu>
CC: Charles Hedrick <hedrick@rutgers.edu>
Subject: Re: [krbdev.mit.edu #8556] missing primary cache after kdestroy
Date: Sat, 4 Mar 2017 02:42:43 +0000
RT-Send-Cc:
Download (untitled) / with headers
text/plain 3.1KiB
roc.gssd seems to interact better with /tmp. It looks for all files owned by the user, verifies that they have principals and credentials, and then picks the newest. It could certainly do the same thing with the KEYRING collection. I’m thinking that at the moment I probably want to use /tmp, either in place of KEYRING or possibly just as a place to make sure there’s at least one current credential file for each active user.

The issue isn’t just cron jobs. Depending upon how you login (whether login on kerberized ssh) you can end up with your KRB5CCNAME set simply to KEYRING:persisent:UID. In effect that gives you the current primary cache for the collection. It’s quite possible that two sessions could have the same setting. (Indeed that seems to be the default when using sssd and sshd with default settings.) That means that if one session does a kdestroy (or if sshd_config is set to destroy the credentials), that can cause the other session to lose its credentials as well. It doesn’t help that there’s no way to configure krb5.conf with the random 6 digits at the end of the template (at least I couldn’t find it documented) and sshd doesn’t seem to have that functionality. So interaction between different ssh sessions seems quite likely.

These’s no magic solution to this. I’m probably going to disable the automatic destroy on logout in sshd_config, and take the view that if a user does kdestroy they deserve what they get. To avoid leaving credentials lying around too long, rather than having users do kdestroy on login, I’m going to default to a fairly short credential lifetime, with a daemon that automatically renews if there’s any process still active for the user.

Cron is actually the least serious issue, since the cron job can easily use a separate credential file that won’t interfere with interactive use.

I think this is going to require a combination of getting users to follow specific practices, a daemon, and careful choice of parameters. But the combination of KEYRING design, sshd design, and rpc.gssd design don’t seem to fit well together.

Show quoted text
> On Mar 3, 2017, at 5:18:55 PM, Greg Hudson via RT <rt-comment@krbdev.mit.edu> wrote:
>
> It's intentional that a collection might be non-empty but its primary
> cache pointer might point to an empty or expired cache. Having the
> primary pointer snap to an arbitrarily chosen cache in the collection
> would be surprising, I think.
>
> I agree that it might be better if gssd could know something about the
> environment of the process invoking the filesystem operation, so that
> cron jobs could use a cache that isn't shared with user login
> sessions. But I don't see a good way to work around that limitation
> within the krb5 library. gssd could search the default cache
> collection for a usable cache in preference to searching files in
> /tmp, but that's still not completely satisfying.
>
> I believe that Red Hat is working on implementing a KCM server in sssd
> to replace their use of the kernel keyring cache, but I don't know if
> it will directly solve this issue because it still won't isolate a
> long-running job from a short-term user login session from gssd's
> point of view.

Download (untitled) / with headers
text/plain 2.3KiB
Show quoted text
> It doesn’t help that there’s no way to
> configure krb5.conf with the random 6 digits at the end of the
> template (at least I couldn’t find it documented) and sshd
> doesn’t seem to have that functionality.

sshd could do this (and I think normally does, but it may be
configured not to in CentOS), but I don't think that would make sense
in krb5.conf. You don't want a random six digits every time the
default ccache is resolved; you would want the same six digits for
each login session, or perhaps some other associated group of
processes. We could add a path substitution for the value of
getsid(), perhaps, but I'm not sure whether or not that would wind up
being useful.

Show quoted text
> I’m probably going to disable the automatic destroy on logout in
> sshd_config, and take the view that if a user does kdestroy they
> deserve what they get.

I believe that is Red Hat's general system design for using Kerberos--
that all login sessions use the same collection and you don't destroy
tickets on logout. It's a little different than the more traditional
design of using a separate cache for each login session and destroying
tickets at the end of the session.

Show quoted text
> But the combination of KEYRING design, sshd design, and
> rpc.gssd design don’t seem to fit well together.

NFS and Kerberos have always been somewhat at odds because so little
process context is communicated from the filesystem-accessing process
and rpc.gssd. AFS has the concept of "process authentication groups"
with explicit credential management from userspace, which is sometimes
cumbersome but has predictable semantics. I'm not recommending using
AFS over NFS, but it provides one example of how a network filesystem
can be integrated with a userspace authentication system.

Credential cache collections (of which KEYRING is one implementation)
are not designed to address the NFS problem. Typically each cache
within a collection has a different client principal. If all login
sessions for a user share the same cache collection, they will also be
making use of the same cache for the username@defaultrealm client
principal, which means that kdestroy will destroy that one cache
containing those credentials. Any other caches in that collection
would be for different client principals and probably would not grant
the same access rights to the NFS server.
From: Charles Hedrick <hedrick@rutgers.edu>
To: "rt-comment@krbdev.mit.edu" <rt-comment@krbdev.mit.edu>
CC: Charles Hedrick <hedrick@rutgers.edu>
Subject: Re: [krbdev.mit.edu #8556] missing primary cache after kdestroy
Date: Sat, 4 Mar 2017 19:04:17 +0000
RT-Send-Cc:
Download (untitled) / with headers
text/plain 3.7KiB

I’ve redesigned my strategy to take all of this into account.

rpc.gssd use the primary cache from KEYRING:persistent:NNN, and all caches in /tmp owned by the user.

As you point out, Redhat by default shares the primary cache in KEYRING for all sessions.

I have modified renewd to renew just the primary cache in KEYRING, and any caches matching /tmp/krb5cc_NNN or /tmp/krb5cc_NNN_foo that are owned by the user.

To avoid race conditions, before renewing the first cache, it copies renewed credentials to /tmp/krb5cc_NNN-renew, which will also be seen by rpc.gssd. That way if rpc.gssd happens to look at a cache while it’s in an invalid state during renewal, the temporary cache will still let it proceed.

I think this will support the common cases. It’s the best I can think to do. My primary goal is to make NFS just work for users who just login in default ways. Even for cron jobs they’ll probably have to follow instructions, though at the moment I think those instructions will just be to register with credserv for that system, and start their job with kgetcred (which pulls credentials from a server running credserv, thus avoiding the need to have a keytab where root could potentially steal it and use it anywhere).

Show quoted text
> On Mar 4, 2017, at 12:41:28 AM, Greg Hudson via RT <rt-comment@krbdev.mit.edu> wrote:
>
>> It doesn窶冲 help that there窶冱 no way to
>> configure krb5.conf with the random 6 digits at the end of the
>> template (at least I couldn窶冲 find it documented) and sshd
>> doesn窶冲 seem to have that functionality.
>
> sshd could do this (and I think normally does, but it may be
> configured not to in CentOS), but I don't think that would make sense
> in krb5.conf. You don't want a random six digits every time the
> default ccache is resolved; you would want the same six digits for
> each login session, or perhaps some other associated group of
> processes. We could add a path substitution for the value of
> getsid(), perhaps, but I'm not sure whether or not that would wind up
> being useful.
>
>> I知 probably going to disable the automatic destroy on logout in
>> sshd_config, and take the view that if a user does kdestroy they
>> deserve what they get.
>
> I believe that is Red Hat's general system design for using Kerberos--
> that all login sessions use the same collection and you don't destroy
> tickets on logout. It's a little different than the more traditional
> design of using a separate cache for each login session and destroying
> tickets at the end of the session.
>
>> But the combination of KEYRING design, sshd design, and
>> rpc.gssd design don稚 seem to fit well together.
>
> NFS and Kerberos have always been somewhat at odds because so little
> process context is communicated from the filesystem-accessing process
> and rpc.gssd. AFS has the concept of "process authentication groups"
> with explicit credential management from userspace, which is sometimes
> cumbersome but has predictable semantics. I'm not recommending using
> AFS over NFS, but it provides one example of how a network filesystem
> can be integrated with a userspace authentication system.
>
> Credential cache collections (of which KEYRING is one implementation)
> are not designed to address the NFS problem. Typically each cache
> within a collection has a different client principal. If all login
> sessions for a user share the same cache collection, they will also be
> making use of the same cache for the username@defaultrealm client
> principal, which means that kdestroy will destroy that one cache
> containing those credentials. Any other caches in that collection
> would be for different client principals and probably would not grant
> the same access rights to the NFS server.