geo-logcursor never exits, for any reason
Problem
As-is, geo-logcursor rescues all StandardErrors when it attempts to take an exclusive lease. Therefore, pretty much any failure occurring inside that block will not cause geo-logcursor to stop.
If running geo-logcursor in a container, it will basically never stop, even when serious problems are occurring.
We can't remove this rescue outright since there are all kinds of errors that can happen when processing a single Geo event, and we don't want to logcursor to die all the time.
Solution
I am working on https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/15248 which will add periodic (every minute ATM) health checks to geo-logcursor. After 5 minutes of consecutive failures, it will exit.
In the first iteration, it will run all liveness health checks listed here: https://docs.gitlab.com/ee/user/admin_area/monitoring/health_check.html#liveness
In a follow up issue https://gitlab.com/gitlab-org/gitlab-ee/issues/14629, Geo-specific health checks will be added. Unfortunately the existing Geo health check code does not "look like" other health checks.
If we need a configurable health check rate or maximum consecutive failures, that could be added in a follow up (but if we can, let's choose good-enough defaults).
@WarheadsSE Would this be sufficient to continue with https://gitlab.com/charts/gitlab/issues/1211?
Additional solution
If the above is insufficient, what if we added:
- Exit after a maximum number or duration of consecutive errors => https://gitlab.com/gitlab-org/gitlab-ee/issues/14944
cc @ashmckenzie