Very rare case of MUPIP RECOVER TRUNCATE concurrent with internal recovery code no longer causes database damage
Final Release Note
A rare case involving a MUPIP REORG TRUNCATE concurrent with internal code that is recovering from unclean process terminations, such as a kill -KILL
, three consecutive MUPIP STOPs issued to a process, an abnormal termination, or manually running DSE CACHE RECOVER does not cause structural database damage. This was only observed in special stress tests in the development environment and was never reported by a user. Note that YottaDB strongly recommends against manually aborting processes, and DSE CACHE RECOVER should only be used by an expert under guidance from your YottaDB support channel. [#755 (closed)]]
Description
This is an issue that was noticed in internal testing of the stress/concurr
subtest (see stress/u_inref/concurr.csh
in YDBTest
repo for test details).
When the test is run with a Debug build of YottaDB, it does a lot of cache recoveries (forced by the test using white-box variables) to really stress the cache recovery logic (wcs_recover()
). The test also runs mupip reorg -truncate
in the background both on the source side and the receiver side of a replicated environment.
In rare cases (once in 100 test runs), the update process on the receiver side failed occasionally with the following assert failure.
%YDB-F-ASSERT, Assert failed in sr_port/gvcst_search.c line 450 for expression (CDB_STAGNATE > t_tries)
The symptom was that the update process read the root block of a GVT (Global Variable Tree) and found the block number to have a level of 0 which is an out-of-design state for a root block (it has to have a level of at least 1 or more). And it found this to be the case even in the final retry when it holds the critical section lock on the entire region.
The issue turned out to be that the mupip reorg -truncate
that was running concurrently in the background had moved the root block but the update process did not recognize this (and re-read from the new root block). And so it kept trying to use the old/stale root block and found that block had now become a level-0 block.
Even though, we saw an assert failure in this case, this issue has the potential of causing database damage if the stale root block turned out to be a level-1 or higher block. In that case, we could potentially continue using the wrong root block and update the wrong portions of the database file resulting in integrity errors.
Draft Release Note
mupip reorg -truncate
preserves database integrity in all cases. Previously, in very rare cases, it was possible for the mupip reorg -truncate
to cause database damage when a cache recovery (which can be triggered in various scenarios internally but can also be triggered by the user using dse cache -recover
) occurs at the same time. [#755 (closed)]]