After disabling Geo, Truncate the geo event log tables
The problem was detected while working on gitlab-com/gl-infra/production#490 (closed).
When Geo gets disabled (all Geo nodes removed), the
Gitlab::Geo::CronManager cron will ensure all Geo-related cron jobs are disabled. Including
But this prune worker was designed to keep running for some time, cause the
geo_event_log table will/might be still filled with events that no one will ever handle.
Simplest proposal would be to just run the prune worker unconditionally (see gitlab-com/gl-infra/production#490 (comment 109567298)). But I feel that's waste.
So the improved proposal:
Geo::CronManagerdoes never disable the
- When the
PruneEventWorkerruns, and it didn't delete any rows, and it detects Geo is disabled, it disables itself
Geo::CronManagerwill re-enable the
PruneEventWorkerwhen Geo gets enabled again. Just as the other Geo jobs and as it does already.
I don't think that would be hard, implementation-wise, but I'm worried it would confuse people when there are jobs enabling/disabling itself on random points in time.
Some research on staging showed that running
TRUNCATE on a very large table (53M records) takes less that a second.
So we can reintroduce the
Geo::TruncateEventLogWorker. The worker is scheduled to run 10min after the last node is deleted (e.g. by a
One thing we need to figure out: Can a
TRUNCATE a large database replication lag?
TRUNCATEis not a cause for concern regarding replication lag. It is generally considered the fastest way to delete data off of postgres. It does however require an exclusive lock on the table (which
But table is completely unused and will stay that way for a considerable amount of time.
I suggest to run the query with
RESTART IDENTITY so the
id will start again at
1 when Geo is enabled again: gitlab-com/gl-infra/production#490 (comment 120080984)
In case only the Geo primary is kept, the
Geo::PruneEventLogWorker will keep running. But
#min_cursor_last_event_id will return
nil, causing the worker to remove nothing. This is fine, cause the admin might have accidentally removed some secondary node. Deleting also the primary node will truncate the whole log, once this issue is implemented.