The idea and motivation are the same as #424267 (closed) however this issue seems to bring down secondary sites.
Ideally, the change from int to bigint will avoid the problem for most, but for those that run into this issue we should provide some way to work around it (Maybe truncating the tables?)
Designs
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Marking this as a bug since a customer has now hit this and we need to resolve it ASAP.
Since we know what needs to be done I've put a workflowready for development and geoactive label.
Ideally we want to truncate upto the point where we safely don't need those events - that could be as late as the last backup or the last time secondary was fully synced. But do we need that for a subsequent successful sync?
@sranasinghe Yes, I think truncating the table will allow the customer to return the Geo secondaries online since we are restarting the auto-increment counters. The other option is to manually migrate the column to bigint if the customer is okay with some downtime/long locks, which would fix the issue quickly.
@dbalexandre (the customer here) - I did actually force the geo_event_log:geo_event_id column to bigint (by adding the new column, copying over values from the old column and renaming columns inside a transaction) but it didn't resolve the problem. That might be because a trigger or foreign key was still referencing the old column, not sure.
I'm in favor of truncating the geo_events and geo_event_log tables, if that works, since we've removed all Geo config entirely so starting from scratch has the benefit of pruning all the unnecessary history from these tables.
There was more discussion in Slack. I'm attempting to write down a workaround until a migration for this is released. I have not tested these steps, so only attempt them first on a staging environment, and let me know if I made a mistake or how this can be improved.
If you are encountering this issue and your primary is down
To initially get back up:
Route all user requests to the primary site.
Disable Geo by deleting the geo_nodes records (don't need to remove PG replication slots or edit configuration, since we will be reenabling Geo soon):
In the primary site, in Rails console or runner, GeoNode.delete_all
In the primary site, restart Puma and Sidekiq in all nodes where they are running gitlab-ctl restart puma; gitlab-ctl restart sidekiq
To resolve the ID limit, and get the secondary site back up
In the primary site, SSH into the PG node (the leader node if running a cluster). Truncate the Geo events tables and change the affected column to bigint so that it can handle very high IDs. Note that after commit, the statement might look blocked for quite some time (no return prompt) until the data is freed. This doesn't mean that the table will be locked.
Restart GitLab on primary and secondary sites. The main services we want to restart are Puma/Sidekiq/Geo Log Cursor, though it's not a bad idea to restart everything.
If you are not yet affected by this issue, but your geo_event_log.geo_event_id is nearing the limit
If your geo_event_log table happens to be small (we don't have a precise size at this time), then you can mitigate this issue in advance by running ALTER TABLE geo_event_log ALTER COLUMN geo_event_id TYPE bigint; on the primary site's Postgres database. This statement locks the table but it may be fast enough if the table is small.
If the tables are large, then this command may lock the table for too long, and the statement may timeout. In that case, you may need to perform the full workaround which disables Geo and resets Geo replication. Truncating the tables in order to alter the column loses Geo events, which may lead to data loss and orphan data on secondaries.
How to determine if your geo_event_log.geo_event_id is nearing the limit
Visit Admin Area > Geo > Sites
In the primary site, look at Last event ID
Compare that number with the limit of a Postgres int: 2,147,483,647
Now that we have a workaround, I am lowering the severity and priority of this issue to severity2priority2. Thank you to everyone who helped identify a workaround.
This groupgeo bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 29 days (2024-02-03).
@aakriti.gupta with Douglas's cleanup MR above, are we able to close this bug or are there more cleanup to do here? Can you also help me confirm the weight I've put on this bug issue?