Allow user control over when process-ending panic is thrown in error situations
Final Release Note
User documentation updated with YottaDB/DB/YDBDoc!530 (merged)
Description
A user has reported an issue when running Go/YottaDB applications in Docker containers and/or Kubernetes pods. When a Kubernetes pod is shutdown, it sends a SIGTERM to the applications running on it. The code this user had was being notified when the SIGTERM happened as was YottaDB (the YDBGo wrapper sets up a notification goroutine for SIGTERM and other signals). But this resulted in a race condition between the YDBGo wrapper and the cleanup handler of the application. If the YDBGo wrapper finishes first, it throws a panic()- whose purpose is to unwind the process and drive the yottadb.Exit() handler which was presumably coded with a defer statement early in the main) whereas if the application rundown finishes first, it exits shutting down the YottaDB cleanup perhaps leaving databases un-flushed.
To avoid this race condition, some sort of cooperation between user code and the wrapper would seem to be needed. The main question right now is what form that support should take. A suggestion has been made to just put a 5-10 second (configurable) sleep that the wrapper waits in an error situation (the wait is bypassed for orderly process exits) prior to driving the panic call. But this neither guarantees that the application is complete with its rundown nor does it help in the other direction with the application finishing first and its exit killing off the wrapper's cleanup.
Another possibility that's not much more complicated is to introduce a generic method for Initializing access to YottaDB with a specific call like yottadb.Init() with perhaps a list of options provided that could specify what sort of options to configure for the DB connection. In the simplest case, we could have an option to NOT set a handler for SIGTERM (might add SIGQUIT to the list too) with the caveat that the application would be responsible for driving yottadb.Exit() before terminating.
The option of not processing SIGTERM leaves a burden on the application though and if there's a path where we could exit without driving yottadb.Exit(), that would be a difficult to find bug. So a further option may be that instead of an option to not handle SIGTERM, we give an option where the user can pass the address of a handler they would like driven if that signal occurs. The wrapper would drive the supplied handler and when it returned, would drive the YDB handler. This eliminates the race condition by making the two cleanups run sequentially instead of in parallel creating a race-condition.
We have yet to choose between these ways forward (which is a non-exhaustive list) so some discussion is warranted.
[Update: 2021/11/02]: We have added yottadb.RegisterSignalHandler() and yottadb.UnRegisterSignalHandler(). The actual added API is described in the next section.
Draft Release Note
The following functions are added to the YottaDB Go Wrapper:
yottadb.RegisterSignalNotify(sig syscall.Signal, notifyChannel, ackChannel chan bool, whenToNotify yottadb.YDBHandlerFlag) error
yottadb.UnRegisterSignalNotify(sig syscall.Signal)
yottadb.Init()
The function descriptions are as follows:
- yottadb.RegisterSignalNotify() - Used to register an interest in being notified on a specified channel when a signal that the Go wrapper already handles (list shown below).
- yottadb.UnRegisterSignalNotify() - Used to discontinue being notified when a given signal occurs.
- yottadb.Init() - While all SimpleAPI routines and the above RegisterSignalNotify()/UnregisterSignalNotify() routines automatically initialize the YottaDB engine, there may be occasions when we want to initialize the engine prior to the first SimpleAPI call. This function does that initialization.
The parameters are:
- sig - the signal the call pertains to (e.g. syscall.SIGINT). The supported list of signals is as follows:
- syscall.SIGABRT (same as syscall.SIGIOT)
- syscall.SIGALRM
- syscall.SIGBUS
- syscall.SIGCONT
- syscall.SIGFPE
- syscall.SIGHUP
- syscall.SIGILL
- syscall.SIGINT
- syscall.SIGQUIT
- syscall.SIGSEGV
- syscall.SIGTERM
- syscall.SIGTRAP
- syscall.SIGURG (same as syscall.SIGPOLL and syscall.SIGIO - seems to happen several times a second)
- syscall.SIGUSR1
- notifyChannel - The channel the user application is notified on when the desired signal occurs.
- ackChannel - The channel the user application uses to let the signal handler know it has completed what it needed to do. The notify routine waits for the acknowledgement for a period of time (currently 15 seconds) after which the YDBGo wrapper goroutine that received the notification will proceed and not wait further for the acknowledgement. Note this channel is flushed (read until empty) before a notification is made.
- whenToNotify - Indicates when in the signal handling process the notification is made (type YDBHandlerFlag). The choices are:
- yottadb.NotifyBeforeYDBSigHandler - notifies channel BEFORE the YDB signal handler is run.
- yottadb.NotifyAfterYDBSigHandler - notifies channel AFTER the YDB signal handler is run (usually kills process for fatal signals).
- yottadb.NotifyAsyncYDBSigHandler - notifies channel same time as YDB signal handler runs (no wait for ACK in this mode).
- yottadb.NotifyInsteadOfYDBSigHandler - notifies channel and NEVER runs the YDB signal handler.
There are currently no errors being returned from RegisterSignalNotify() but the error is added to the interface now for use in future versions. If an unsupported signal is passed, the present action is to throw a panic() but future versions will return an error in that event when YottaDB/DB/YDB#790 is completed.
This support also "sort of" removes two signals from being handled. The signals SIGIO and SIGIOT were removed but because their numeric values are identical to two other signals (SIGURG and SIGABRT respectively) that we already handle, there is no loss of functionality. Should SIGIO or SIGIOT be used in this new RegisterSignalHandler() interface, they will work but will be handled as SIGURG or SIGABRT respectively. This was also the source of the occasional previous doubling of messages (seen only internally) since there were two handlers being driven for each of these two signals so the handler ran twice.
This support also fixed some occurrences of Go DATA RACE issues when shutting down. The failures were only seen with the new test written for this support (go/ydbgo34) but since the data race was with the wgexit Mutex in misc.go, it is conceivable a test could be written and similarly fail using a version without the added support of this MR.
Note this update also adds additional Go version testing to the pipeline (was not testing 1.16 or 1.17 at all).
Note support for SIGHUP was added as it is handled as of r1.34 containing GT.M V6.3-011.
Some Usage Notes:
- Signal handling is much improved with this MR but issues still exist (details in YottaDB/DB/YDB#790 description). This MR is the first piece of the signal handling improvements planned with two more pieces planned for future YottaDB and Go wrapper versions. Consequently, one needs to be very careful when using signals with the YDBGo wrapper. The areas to be careful with are noted as further elements in this list.
- Always put a 'defer yottadb.Exit()' in your main so YottaDB is cleaned up as your application exits. This takes care of the "normal exit" case where goroutines terminate and control bubbles back up through the main which eventually terminates when the main routine exits.
- It is best to avoid application exits that use 'out-of-band' means to exit (e.g. os.Exit()). The reason is this bypasses orderly rundown of all goroutines which we have found is key to getting everything to shutdown correctly.
- The main program, or its agents, need to be aware of all the goroutines that have been launched that have done calls to YDB and make sure all of those goroutines have completed and shutdown before the main program is allowed to exit. Failure to do this is likely to cause YottaDB rundown procedures to be cut short or bypassed all together. This is what causes database damage.
- We strongly suggest NOT using signal.Notify() for any signal that YottaDB is using in the Go wrapper (full list in a table at the top of init.go - currently 17 signals) but instead use the yottadb.RegisterSignalNotify() function for notification of these signals. When a user uses signal.Notify, this creates a race condition between the application's direct notify and the YottaDB signal handler since both are notified at the same time. In the YottaDB handler for fatal signals, the process will shutdown the YottaDB engine perhaps while the user's handler is trying to do something else (cleanups and/or SimpleAPI calls). The subset of these signals that applications can use the wrapper's notification for are listed at the top of this entry.
- Note that SIGUSR2 is USED internally by the alternate signal support in YottaDB so it should NOT be otherwise used.
- Everything said above applies ONLY to ASYNCHRONOUS signals as defined by Go - these are signals that are sent to the process by kill(self,xx) type calls. It does NOT apply to SYNCHRONOUS signals that are signals that are raised by the hardware itself or by the OS that apply directly to what was being executed - e.g. a SIGSEGV that occurs as a result of dereferencing a NULL or otherwise invalid addres. The Go documentation mentions SIGBUS, SIGFPE, and SIGSEGV as being synchronous signals but I'll be surprised if SIGILL is not also one when invalid code is executed as the process cannot asynchronously continue. Note that if you send a process a SIGSEGV, then it is considered an asynchronous signal and is handled accordingly.
- At this time, YottaDB does not do anything to handle synchronous signals, which Go automatically turns into a panic. That will likely change after YDB#790 is complete.
The test program for this support which can serve as an example, can be found in YottaDB/DB/YDBTest/go/inref/ydbgo34a.go. This program uses the above mentioned support to wait for a randomly selected signal sent to the process and deal effectively with it. Prior to the signal coming in, 4 goroutines are started up that are doing IncrE() calls on a database node in TP when the signal hits. The possible signals and how they are dealt with are denoted in a table at the top of the routine. This test is currently avoiding letting the YDB signal handler run for any fatal signals as that mode of operation runs into the issues described in YottaDB/DB/YDB#790. But note how the main starts up each of the worker goroutines with a waitgroup so it knows to wait for that goroutine to finish. Also, there is a flag that all iterations in the worker goroutines check for each iteration so it is easy to signal a shutdown to all goroutines. Once they all shutdown, the program terminates. There is no early termination of any goroutines or threads. They pretty much shutdown in an orderly fashion.
Note this commit also addresses #37 (closed) - which was actually fixed in development before the issue was created as it was also seen in an earlier version of the ydbgo34a.go portion of the test. The fix here involved the 'lclSignalActive' flag used in the shutdownSignalGoroutines() function in init.go. The new support does not require that a signal that is "active", meaning it has occurred and is waiting for handling to be completed, be complete before allowing shutdown to proceed. This is because if we are already shutting down, the signal is moot anyway.
There are 3 timers used by the wrapper when shutting down the YottaDB engine. The duration of the timers are held in global values that are user accessible for update (defined in yottadb.go). The three timer descriptions are as follows:
- The yottadb.Exit() timer. The yottadb.Exit() routine runs C.ydb_exit() but in the event that C.ydb_exit() cannot get the engine lock because it is already held, the timer prevents a permanent deadlock. If it does timeout, a DBRNDWNBYPASS warning message is sent to syslog. There are two default timer values associated with this timer. The yottadb.MaximumPanicExitWait value is used if there has been a fatal (asynchronous) signal that drove a panic. Otherwise, the timeout used is yottadb.MaximumNormalExitWait.
- The timer used in shutdownSignalGoroutines (in init.go) which stops the goroutine for each of the signal being handled. It is possible that a goroutine is busy or otherwise hung so it may not complete this although that seems to be fairly rare with the changes in this version. The value that controls this wait is yottadb.MaximumSigShutDownWait.
- When user signal notification is being used and the wrapper notifies a goroutine by posting to a signal's notification channel, it is expected that when the goroutine is done with the signal, it will post on the acknowledgement channel. If that acknowledgement is delayed longer than yottadb.MaximumSigAckWait seconds, the wrapper gives up waiting and wrapper continues handling the signal or going back to waiting for another occurrence of the signal. If the timer expires, a SIGACKTIMEOUT warning is sent to syslog.
For each of the yottadb.MaximumWait values exists a yottadb.DefaultMaximumWait variable constant that holds the initial default for that value.