It is a REAL bug. However, I did not understand the circumstances correctly.
If a full reload is required, the following sequence happens:
- The dump is transmitted from the master to the slave
- The ulog is initialized with the serial number information
- The database is loaded into temporary files with a ~ filename
Issue: if the database load does not complete, the ulog serial number has already been set, so the slave can run with an older database yet report that it is current.
I need to research further… the code doesn’t look like this situation should have been possible unless the “time jumped” or something happened in parallel which should not be possible under normal circumstances.
I am beginning to suspect I lost track of the state and was flipping states but had forgotten to quiesce a full propagation in-progress…
Ignore this bug report until I can find the issue and/or reproduce the situation.
The new iprop dump / restore code has a significant bug (my patches could not have contributed to this issue)…
From what I can discern, when a FULL RESYNC is required, the admin server will check if a dump already exists with the serial/timestamp in the ulog (so far so good). However, I think the check must be flawed… I upgraded from an earlier version and still had slave_datatrans_* files from before with older entries. Furthermore, I had restarted the ulog (since the stock code doesn’t preserve the ulog, I have to assume I might have to update from a slave to a master and that will force a re-init of the ulog). So, in essence, even the last slave_datatrans file might have had a sno/timestamp, but it shouldn’t match anything in the ulog… a couple updates come in, and now the serial numbers are “in range”.
Now, here’s where things go completely awry…
The slave got an old database copy, but the updates applied since were new.
I am not sure if it picked up the updates from the older slave_datatrans_<hostname> files or if the problem was the reinit and the sno/timestamp check not being sufficient, but the result was an old database and the ulog being reported after the transfer was the CURRENT sno/timestamp.
When I checked from_master on the load, it looked like the new db sno… so I know the problem was with the dump/transfer (a section of code I did NOT change with my patches).
I will try to delve into the problem further, but one should assume a slave will need to be promoted to a master on occasion and other slaves will be redirected to that master after having received updates from other sources, so this is a data integrity bug… (I’ll send a patch if I figure out the cause, but all indications is it is not related to my prior patches but somehow related to the new conditional dump code; it certainly was a very lazy sno check, though from first glance I thought it would be ok, but perhaps it really needs to be a proper ulog check…)