The ulog serial number is set before a reload operation begins; if the reload fails (or if the new database cannot be rendered active), the slave ends up with the updated serial number but a stale database which is not reflective of the version which has been set in the ulog.
I am currently testing the attached patch, which should resolve the issue:
Currently, this situation can lead to a data consistency issue, with significant repercussions and worse yet, be somewhat silent about the failure having occurred.
I need to research further… the code doesn’t look like this situation should have been possible unless the “time jumped” or something happened in parallel which should not be possible under normal circumstances.
I am beginning to suspect I lost track of the state and was flipping states but had forgotten to quiesce a full propagation in-progress…
Ignore this bug report until I can find the issue and/or reproduce the situation.
The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…
From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good). However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries. Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog). So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.
Now, here’s where things go completely awry…
The slave got an old database copy, but the updates applied since were new.
I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.
When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).
I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug… (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)