Source Server continues when instance is frozen because of an error switching journal files

Final Release Note

Source Server processes continue operation even when the instance is frozen due to an error while switching journal files. In prior versions of YottaDB (and GT.M versions V6.3-001A and above), the Source Server could terminate with a JNLSWITCHRETRY error in this uncommon occurrence. This issue was only observed in the development environment and was never reported by a user. (#235 (closed))

Description

The v62000_1/gtm8086 subtest failed because the source server terminated prematurely with a JNLSWITCHRETRY error.

The test drives a lot of processes doing updates to the db/jnls with a small autoswitchlimit value so journal file switches happen frequently (once or twice a second).

After one journal switch has happened, the test turns off write permissions to the directory housing the journal files ("jnldir").

Since this test runs with replication and has instance freeze turned on, the test expects the instance to be frozen when a journal switch happens after write permissions have been turned off (that did freeze the instance in this test run).

Soon after the test turns write permissions back on "jnldir".

In this test run though, it so happened that the source server had sent all seqnos until the penultimate generation journal file so it tries to open the latest generation journal file (say JNL1) which is full of data (upto the autoswitch limit).

But before the source server could open JNL1, this journal file JNL1 had just been closed concurrently by another mumps process (in preparing to switch to a new journal file) and the field "is_not_latest_jnl" was set in the jnl file header of JNL1 but the concurrent process encountered an error while trying to create a new generation journal file JNL2 (due to write permissions) so it did not rename the JNL1 file (.mjl --> .mjl_..)

That meant the source server saw JNL1 and wanted to open it ("jnl_file_open_common") but saw the "is_not_latest_jnl" field and issued a JNLSWITCHRETRY error in the source server log and terminated. That in turn caused the test failure.

This is an edge-case code issue in the source server since GT.M V6.3-001A when the JNLSWITCHRETRY error was introduced.

In the case of mumps processes, once they fail the "jnl_file_open_common" call with JNLSWITCHRETRY error, they will try the "jnl_file_open_switch" call but because the instance is frozen, they will hang (trying to create a new journal file will go through LSEEKWRITE which will hang).

But in the case of the source server, we have code in jnl_file_open.c which disables the call to "jnl_file_open_switch" because we decided the source server is a read-only process and should never create journal files.

And so it takes a different codepath which is to print the JNLSWITCHRETRY error.

The source server should not issue the error but instead read from the latest generation journal file as if it is an older generation journal file (since this journal file will never be the currently open journal file in shared memory again) just like it would read from any prior generation journal file.

Draft Release Note

The replication source server works correctly even if the replication instance is frozen due to an error while switching journal files. In prior versions of YottaDB (and GT.M versions V6.3-001A and above), the source server could terminate with a JNLSWITCHRETRY error in this scenario in rare cases.