geo-logcursor never exits, for any reason

Problem

As-is, geo-logcursor rescues all StandardErrors when it attempts to take an exclusive lease. Therefore, pretty much any failure occurring inside that block will not cause geo-logcursor to stop.

If running geo-logcursor in a container, it will basically never stop, even when serious problems are occurring.

We can't remove this rescue outright since there are all kinds of errors that can happen when processing a single Geo event, and we don't want to logcursor to die all the time.

Solution

I am working on https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/15248 which will add periodic (every minute ATM) health checks to geo-logcursor. After 5 minutes of consecutive failures, it will exit.

In the first iteration, it will run all liveness health checks listed here: https://docs.gitlab.com/ee/user/admin_area/monitoring/health_check.html#liveness

In a follow up issue https://gitlab.com/gitlab-org/gitlab-ee/issues/14629, Geo-specific health checks will be added. Unfortunately the existing Geo health check code does not "look like" other health checks.

If we need a configurable health check rate or maximum consecutive failures, that could be added in a follow up (but if we can, let's choose good-enough defaults).

@WarheadsSE Would this be sufficient to continue with https://gitlab.com/charts/gitlab/issues/1211?

Additional solution

If the above is insufficient, what if we added:

Exit after a maximum number or duration of consecutive errors => https://gitlab.com/gitlab-org/gitlab-ee/issues/14944

cc @ashmckenzie

Edited Sep 09, 2019 by Michael Kozono