MUPIP FREEZE ONLINE ensures database and journal file integrity in a rare case

Final Release Note

MUPIP FREEZE ONLINE ensures database and journal file integrity. Previously, in a rare case when the MUPIP FREEZE ONLINE process timed out waiting to get access, it was possible for the process to update database and/or journal files without an exclusive lock, which resulted in database/journal file structural damage. This was only encountered in a development environment and not reported by a user. [#738 (closed)]

Description

In-house testing of v63003/gtm8850 showed a failure in one of the ARMV7L systems with the following failure diff.

5a6,7
> v63003_0/gtm8850/x6.out
> %YDB-E-JNLSWITCHFAIL, Failed to switch journal file v63003_0/gtm8850/mumps.mjl for database file

This happened only once in many hundreds of test runs. After some analysis, it turns out to be an issue with the Online freeze functionality which can result in database and/or journal file corruption in very rare cases.

An online freeze process (the one running the mupip freeze -on -online command) is treated like dse or lke in that it can bypass the database startup lock code if it cannot get the lock in 3 seconds (e.g. in case of a loaded system).

This means it is possible in very rare cases (with the right timing of events across multiple processes) for the online freeze process (say P1) to attach to database shared memory at the exact same time another process (say P2) running database rundown (while holding the database startup/shutdown lock) is removing that very same shared memory from the system and recording that fact in the database file header. Note that it is possible for P2 to not see that P1 is in the process of attaching to the database because P1 bypassed the database startup lock code (i.e. if there was no bypass, this simultaneous attach/remove issue is not possible).

This means if a new process (say P3) starts accessing the database it would create a new shared memory segment for that database file while P1 is still using a deleted shared memory id to connect to the same database.

In effect, we would have two processes P1 and P3 accessing the same database file using two different shared memory segments.

Each would be holding the database critical section lock incorrectly thinking they have an exclusive lock on the database (because the critical section lock is part of shared memory which is different for both processes).

And since online freeze flushes dirty buffers and switches journal files, it is now possible for P1 to be writing to the database and/or journal file while P3 is doing exactly that (e.g. due to updates in the M program it runs).

And having multiple processes do updates to the database file or journal file is a recipe for mysterious errors including db/jnl corruption.

Across a total of 69 failures (out of 72,000 in-house test runs), we saw the following symptoms using a mix of Release/PRO and Debug/DBG builds.

39 - %YDB-E-REQRECOV
10 - %YDB-E-JNLSWITCHFAIL
 8 - Assert failed CRE_JNL_FILE.C line 483 for expression (EACCES == info->status)
 5 - %YDB-E-JNLFILOPN
 4 - WAITPROCALIVE-E-WAITTOOLONG because mumps processes are waiting due to a frozen database
 1 - %YDB-E-JNLOPNERR
 1 - Assert failed T_END.C line 701 for expression (!FROZEN_HARD(csa) || IS_DSE_IMAGE)
 1 - Assert failed JNL_FILE_OPEN.C line 223 for expression (ydb_white_box_test_case_enabled && ((WBTEST_JNLOPNERR_EXPECTED == ydb_white_box_test_case_number) || (WBTEST_JNL_CREATE_FAIL == ydb_white_box_test_case_number)))

This issue is more likely to happen in cases where processes that attach to the database are short lived and come up and down a lot of times. This means it is unlikely to be an issue in replicated regions as the source server has the database file open and is usually running in the background for the entire time the database is open and therefore it is unlikely for any process that is exiting (like P2 in the above description) to conclude that it is the last process to detach (and so has to remove the database shared memory segment). But it is not guaranteed since the source server can go down on its own due to an error or be shut down by operator action explicitly while other processes are still accessing the replicated database region. In that case, this issue can occur even in replicated regions.

Draft Release Note

Online freeze preserves database and journal file integrity in all cases. Previously, in rare cases when the online freeze process timed out trying to get a hold of the database startup access control lock, it was possible for the online freeze to update the database and/or journal files without effectively holding an exclusive lock which resulted in database/journal file corruption.

Edited Jul 06, 2021 by K.S. Bhaskar