After disabling Geo, Truncate the geo event log tables
The problem was detected while working on gitlab-com/gl-infra/production#490 (closed).
Problem
When Geo gets disabled (all Geo nodes removed), the Gitlab::Geo::CronManager
cron will ensure all Geo-related cron jobs are disabled. Including geo_prune_event_log_worker
.
But this prune worker was designed to keep running for some time, cause the geo_event_log
table will/might be still filled with events that no one will ever handle.
Rejected proposals
Simplest proposal would be to just run the prune worker unconditionally (see gitlab-com/gl-infra/production#490 (comment 109567298)). But I feel that's waste.
So the improved proposal:
-
Geo::CronManager
does never disable thePruneEventWorker
- When the
PruneEventWorker
runs, and it didn't delete any rows, and it detects Geo is disabled, it disables itself -
Geo::CronManager
will re-enable thePruneEventWorker
when Geo gets enabled again. Just as the other Geo jobs and as it does already.
I don't think that would be hard, implementation-wise, but I'm worried it would confuse people when there are jobs enabling/disabling itself on random points in time.
Proposal
Some research on staging showed that running TRUNCATE
on a very large table (53M records) takes less that a second.
So we can reintroduce the Geo::TruncateEventLogWorker
. The worker is scheduled to run 10min after the last node is deleted (e.g. by a Geo::NodeDestroyService
).
One thing we need to figure out: Can a TRUNCATE
a large database replication lag?
TRUNCATE
is not a cause for concern regarding replication lag. It is generally considered the fastest way to delete data off of postgres. It does however require an exclusive lock on the table (whichDELETE
doesn't).But table is completely unused and will stay that way for a considerable amount of time.
I suggest to run the query with RESTART IDENTITY
so the id
will start again at 1
when Geo is enabled again: gitlab-com/gl-infra/production#490 (comment 120080984)
Edge case
In case only the Geo primary is kept, the Geo::PruneEventLogWorker
will keep running. But #min_cursor_last_event_id
will return nil
, causing the worker to remove nothing. This is fine, cause the admin might have accidentally removed some secondary node. Deleting also the primary node will truncate the whole log, once this issue is implemented.