Skip Menu |
 

Date: Thu, 03 Jan 2013 21:28:50 -0500
From: Richard Basch <basch@alum.mit.edu>
Subject: krb5-1.11 iprop bug
To: krb5-bugs@mit.edu
Download (untitled) / with headers
text/plain 1.9KiB

The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…

 

From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good).  However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries.  Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog).  So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.

 

Now, here’s where things go completely awry…

 

The slave got an old database copy, but the updates applied since were new.

 

I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.

 

When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).

 

I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug…  (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)

Date: Thu, 03 Jan 2013 22:41:44 -0500
From: Richard Basch <basch@alum.mit.edu>
Subject: RE: krb5-1.11 iprop bug
To: krb5-bugs@mit.edu
Download (untitled) / with headers
text/plain 2.5KiB

I need to research further… the code doesn’t look like this situation should have been possible unless the “time jumped” or something happened in parallel which should not be possible under normal circumstances.

 

I am beginning to suspect I lost track of the state and was flipping states but had forgotten to quiesce a full propagation in-progress…

 

Ignore this bug report until I can find the issue and/or reproduce the situation.

 

 

From: Richard Basch [mailto:basch@alum.mit.edu]
Sent: Thursday, January 03, 2013 9:29 PM
To: 'krb5-bugs@mit.edu'
Subject: krb5-1.11 iprop bug

 

The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…

 

From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good).  However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries.  Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog).  So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.

 

Now, here’s where things go completely awry…

 

The slave got an old database copy, but the updates applied since were new.

 

I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.

 

When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).

 

I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug…  (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)

Date: Sat, 12 Jan 2013 07:58:40 -0500
From: Richard Basch <basch@alum.mit.edu>
Subject: RE: krb5-1.11 iprop bug
To: krb5-bugs@mit.edu
Download (untitled) / with headers
text/plain 3.1KiB

It is a REAL bug.  However, I did not understand the circumstances correctly.

 

If a full reload is required, the following sequence happens:

-        The dump is transmitted from the master to the slave

-        The ulog is initialized with the serial number information

-        The database is loaded into temporary files with a ~ filename

Issue: if the database load does not complete, the ulog serial number has already been set, so the slave can run with an older database yet report that it is current.

 

 

From: Richard Basch [mailto:basch@alum.mit.edu]
Sent: Thursday, January 03, 2013 10:42 PM
To: 'krb5-bugs@mit.edu'
Subject: RE: krb5-1.11 iprop bug

 

I need to research further… the code doesn’t look like this situation should have been possible unless the “time jumped” or something happened in parallel which should not be possible under normal circumstances.

 

I am beginning to suspect I lost track of the state and was flipping states but had forgotten to quiesce a full propagation in-progress…

 

Ignore this bug report until I can find the issue and/or reproduce the situation.

 

 

From: Richard Basch [mailto:basch@alum.mit.edu]
Sent: Thursday, January 03, 2013 9:29 PM
To: 'krb5-bugs@mit.edu'
Subject: krb5-1.11 iprop bug

 

The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…

 

From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good).  However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries.  Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog).  So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.

 

Now, here’s where things go completely awry…

 

The slave got an old database copy, but the updates applied since were new.

 

I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.

 

When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).

 

I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug…  (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)

I see that dump.c's load_db() alters the iprop header before calling
restore_dump() and krb5_db_promote(). However, this also seems to be true
in all previous releases which include iprop. So, I think this is a pre-
existing bug?

Also, do you know what caused the load to fail?
Date: Tue, 22 Jan 2013 19:20:14 -0500
From: Richard Basch <basch@alum.mit.edu>
Subject: RE: [krbdev.mit.edu #7530] krb5-1.11 iprop bug
To: rt-comment@krbdev.mit.edu
RT-Send-Cc:
Yes, I believe it is a pre-existing bug. I saw the problem also occur in
krb5-1.10.x, though I had not previously identified the cause.

I am not sure what caused the load to fail yet (only that it did and I am
not convinced any error messages were generated).


Show quoted text
-----Original Message-----
From: Greg Hudson via RT [mailto:rt-comment@krbdev.mit.edu]
Sent: Tuesday, January 15, 2013 9:52 AM
To: basch@alum.mit.edu
Subject: [krbdev.mit.edu #7530] krb5-1.11 iprop bug

I see that dump.c's load_db() alters the iprop header before calling
restore_dump() and krb5_db_promote(). However, this also seems to be true
in all previous releases which include iprop. So, I think this is a pre-
existing bug?

Also, do you know what caused the load to fail?
Date: Sun, 03 Mar 2013 23:03:28 -0500
From: Richard Basch <basch@alum.mit.edu>
Subject: RE: krb5-1.11 iprop bug
To: krb5-bugs@mit.edu
CC: "'Richard Basch'" <basch@alum.mit.edu>
Download (untitled) / with headers
text/plain 3.2KiB

The ulog serial number is set before a reload operation begins; if the reload fails (or if the new database cannot be rendered active), the slave ends up with the updated serial number but a stale database which is not reflective of the version which has been set in the ulog.

 

I am currently testing the attached patch, which should resolve the issue:

https://github.com/rbasch/krb5/commit/2ef5ae0607d1c317a936e439b4be7a6f5184dc2f

 

Currently, this situation can lead to a data consistency issue, with significant repercussions and worse yet, be somewhat silent about the failure having occurred.

 

 

From: Richard Basch [mailto:basch@alum.mit.edu]
Sent: Thursday, January 03, 2013 10:42 PM
To: 'krb5-bugs@mit.edu'
Subject: RE: krb5-1.11 iprop bug

 

I need to research further… the code doesn’t look like this situation should have been possible unless the “time jumped” or something happened in parallel which should not be possible under normal circumstances.

 

I am beginning to suspect I lost track of the state and was flipping states but had forgotten to quiesce a full propagation in-progress…

 

Ignore this bug report until I can find the issue and/or reproduce the situation.

 

 

From: Richard Basch [mailto:basch@alum.mit.edu]
Sent: Thursday, January 03, 2013 9:29 PM
To: 'krb5-bugs@mit.edu'
Subject: krb5-1.11 iprop bug

 

The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…

 

From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good).  However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries.  Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog).  So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.

 

Now, here’s where things go completely awry…

 

The slave got an old database copy, but the updates applied since were new.

 

I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.

 

When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).

 

I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug…  (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)

From: ghudson@mit.edu
Subject: git commit

Reset ulog header if iprop load fails

If an iprop slave tries to load a dump from the master and it fails,
reset the ulog header so we take another full dump, instead of
reporting that the slave is current when it isn't. Reported by
Richard Basch <basch@alum.mit.edu>.

https://github.com/krb5/krb5/commit/b1314b12b12e6cdbe80338010f265ccdaf359e4e
Author: Greg Hudson <ghudson@mit.edu>
Commit: b1314b12b12e6cdbe80338010f265ccdaf359e4e
Branch: master
src/kadmin/dbutil/dump.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
ulog_init_header isn't in 1.11, so use Richard's patch (I'm attaching it)
for backport.
Download patch.txt
text/plain 1.1KiB
commit 2ef5ae0607d1c317a936e439b4be7a6f5184dc2f
Author: rbasch <probe@tardis.internal.bright-prospects.com>
Date: Sun Mar 3 22:55:41 2013 -0500

Reset ulog if database load failed.
Avoids a slave reporting it is current when a full resync fails

diff --git a/src/kadmin/dbutil/dump.c b/src/kadmin/dbutil/dump.c
index c136ff3..099f521 100644
--- a/src/kadmin/dbutil/dump.c
+++ b/src/kadmin/dbutil/dump.c
@@ -2977,6 +2977,20 @@ error:
*/
if (!(flags & FLAG_UPDATE)) {
if (exit_status) {
+
+ /* Re-init ulog so we don't accidentally think we are current */
+ if (log_ctx && log_ctx->iproprole) {
+ log_ctx->ulog->kdb_last_sno = 0;
+ log_ctx->ulog->kdb_last_time.seconds = 0;
+ log_ctx->ulog->kdb_last_time.useconds = 0;
+
+ log_ctx->ulog->kdb_first_sno = 0;
+ log_ctx->ulog->kdb_first_time.seconds = 0;
+ log_ctx->ulog->kdb_first_time.useconds = 0;
+
+ ulog_sync_header(log_ctx->ulog);
+ }
+
kret = krb5_db_destroy(kcontext, db5util_db_args);
/*
* Ignore a not supported error since there is nothing to do about
From: tlyu@mit.edu
Subject: git commit

Reset ulog if database load failed

If an iprop slave tries to load a dump from the master and it fails,
reset the ulog header so we take another full dump, instead of
reporting that the slave is current when it isn't.

[ghudson@mit.edu: commit message]

https://github.com/krb5/krb5/commit/74b3f961e15d2eee5bad93d2a224c10834bbaab8
Author: rbasch <probe@tardis.internal.bright-prospects.com>
Committer: Tom Yu <tlyu@mit.edu>
Commit: 74b3f961e15d2eee5bad93d2a224c10834bbaab8
Branch: krb5-1.11
src/kadmin/dbutil/dump.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)