Ensure logical consistency between primary and secondary instances after a switchover following kill -9 of processes on secondary
Final Release Note
MUPIP JOURNAL -ROLLBACK -BACKWARD -FETCHRESYNC keeps user data in sync between primary and secondary sides of a replication connection when processes on the secondary are terminated with kill -9
followed by a switchover. Previously, this could cause primary and secondary database instances to have different content. Note that YottaDB strongly recommends against use of kill -9
. [#362 (closed)]
Description
When we ran the recently enabled manually_start/dual_fail2_no_ipcrm2 subtest, we noticed a failure only on the ARM platform with the following diff.
> cat dual_fail2_no_ipcrm2.diff
53c53
< DATABASE EXTRACT PASSED
---
> TEST-E-FAILED: DATABASE EXTRACTs on PRIMARY and SECONDARY are DIFFERENT. Check RF_EXTR_debuglog.out for details
The databases on primary and secondary side were different at the end of the test. Below is the actual data diff.
> cat pri_21_51_23.glo_sec_21_51_23.glo_spanreg_glodiff
7679c7679
< ^dntp(0,47902994)="somejunk"
---
> ^dntp(0,47902994)="-254.34"
This turned out to be a portable issue in YottaDB i.e. is an issue even on Linux/x86_64 builds. Below are details of the analysis.
The journal extract on the primary and secondary side show the following records corresponding to the data that is different.
Primary
20740 0x002ae6a8 [0x0058] :: TSTART \64912,78256\36354\3272188576\32446\0\4147\0\0
20741 TSET \64912,78256\36354\3272188576\32446\0\4147\0\0\4\1\^dntp(0,47902994)="somejunk"
20742 0x002ae700 [0x0038] :: TCOM \64912,78256\36354\3699605187\32446\0\4147\0\0\4\BA
20743 0x002ae738 [0x0050] :: SET \64912,78256\36355\1971748481\32446\0\4148\0\0\0\0\^dntp(0,47902994)="-254.34"
Secondary
21485 0x002bfe80 [0x0058] :: TSTART \64912,78293\36354\2396333726\32276\0\4147\0\0
21486 TSET \64912,78293\36354\2396333726\32276\0\4147\0\0\4\1\^dntp(0,47902994)="somejunk"
21487 0x002bfed8 [0x0038] :: TCOM \64912,78293\36354\2523728995\32276\0\4147\0\0\4\BA
21488 0x002bff10 [0x0030] :: NULL \64912,78293\36355\2306412254\32276\0\4148\0\0
21489 0x002bff40 [0x0020] :: ALIGN \64912,78293\0\3780385109\32264\0
Notice how there were two SETs of ^dntp on the primary side, but there is only one SET on the secondary side. The second SET on the primary side, which happened at jnlseqno of 4148 happened on the secondary as a NULL record. This is possible because the test did a crash of the secondary side where the update process was killed using kill -9 but the database shared memory was not removed (like it would if this was a system crash). Note that since GT.M V6.3-002 (and YottaDB r1.10), as part of GTM-8436 (http://tinco.pair.com/bhaskar/gtm/doc/articles/GTM_V6.3-002_Release_Notes.html#GTM-8436), journal records are now committed in two phases. If the update process was in the process of committing seqno 4148 when it got killed, it is possible it had reserved space in the journal buffers (done in phase1 of commit while holding the db critical section lock) but before it wrote the journal records in phase2 of commit (when it does not hold the db critical section lock) it got killed.
In this state, later when the test did a MUPIP JOURNAL -ROLLBACK -BACKWARD on the secondary, it found out database shared memory still lying around so tried to flush whatever it can out to disk before starting the rollback. It found the reserved space for the SET record in the journal buffer section but had no idea of the actual journal record content (since this is a different process than the update process) and so filled that space with a NULL/ALIGN journal record combination (the 0x50 space of the SET record was split as 0x30 for a NULL record and 0x20 for an ALIGN record). And after running down the database shared memory in this fashion, it proceed with the rollback which played forward this seqno=4148 with a NULL journal record effectively creating a difference in data between the primary and secondary.
Although kill -9 is not supported, a rollback should have handled this better by recognizing this NULL record was not a user-initiated NULL record but instead was a filler NULL record and so identify this as a lost transaction and play this not onto the database but into the lost transaction file. That would have avoided the data discrepancy between the primary and secondary and yet give the user an option to see all the data that is considered lost and play anything they choose to.
The proposed fix is to record in the NULL and INCTN records (used in replication and non-replication environments respectively to fill up reserved space) the fact that these are filler records that way any journal recover/rollback can safely treat these as lost transactions and avoid the data discrepancy issue altogether.
Draft Release Note
MUPIP JOURNAL -ROLLBACK -BACKWARD -FETCHRESYNC maintains user data in sync between primary and secondary side of a replicated environment in case processes are killed by kill -9. Previously, it was possible for it to incorrectly cause the primary and secondary database files to have different data content.