Incident Review: Runner verification API returning 500

Key Information

Metric	Value
Customers Affected	Roughly 550 top-level namespaces were impacted (~12% of total namespaces). 3,807 unique group runners were affected (0.37% of all runners). (logs have aged out at the time of writing)
Requests Affected	Roughly 5 million requests to `/api/:version/runners/verify` returned `500` errors. It is expected that some of these were retries. (logs have aged out at the time of writing)
Incident Severity	severity2
Start Time	`04:18am UTC`
End Time	`22:05 UTC`
Total Duration	17hrs 47min
Link to Incident Issue	#18792 (closed)

Summary

Customer Impact

On Thursday 31st October, roughly 550 top-level namespaces utilising group runners experienced server errors when trying to create runner managers with an unknown system ID (system IDs can be randonmly generated, namely in Docker or K8s environments), or starting runners that had been inactive during the 7 days prior to this incident. This resulted in the inability to run jobs using these runners during the course of this incident. Once the incident was resolved, retrying should have succeeded.

Root Cause

This was the result of a foreign key constraint being introduced prior to backfilling the referenced CI Runners table, resulting in attempts to reference records that might not yet exist in the production database. Although the tables in question are partitioned copies of the real tables and not yet in use, they are being kept in sync by database triggers, which surfaced the problem. Any runner created, contacted, or otherwise modified since gitlab-org/gitlab!166308 (merged) was deployed to production (30 October 2024 at 00:07:21 UTC) would not be affected, since the trigger would have copied the updated record to the new ci_runners_e59bb2812d table.

Delays in Remediation

We attempted to fix this issue through deploying a revert MR that would run a rails migration to drop the failing Foreign Key Constraint, however this did not have the desired effect. Namely, dropping the FK constraint from the parent routing table didn't cause the FKs in the child partitions to be dropped.

This led the team to manually drop the foreign key constraint from the ci database using a Change Request and following a review process to minimise potential risks with this approach, rather than waiting for the follow up MR fix to run another rails migration.

Details

Context for the Change

As part of gitlab-org/gitlab#460084 (closed) we are introducing sharding keys for our ci_runner_machines/ci_runners tables in preparation for Cells 1.0.

What Happened?

In gitlab-org/gitlab!168131 (merged) we created the ci_runner_machines_687967fa8a partitioned table. This table contained a foreign key constraint that relied on the existence of data in the ci_runners_e59bb2812d table. However, the MR to backfill data into this table had not been merged.

This meant any time an attempt was made to create a new runner manager attached to a group runner, the following type of error occurred:

 "PG::ForeignKeyViolation: ERROR:  insert or update on table \"group_type_ci_runner_machines_687967fa8a\" violates foreign key constraint \"fk_rails_3f92913d27\"\nDETAIL:  Key (runner_id, runner_type)=(39256895, 2) is not present in table \"group_type_ci_runners_e59bb2812d\".\nCONTEXT:  SQL statement \"INSERT INTO ci_runner_machines_687967fa8a (\"id\",\n    \"runner_id\",\n    \"sharding_key_id\",\n    \"created_at\",\n    \"updated_at\",\n    \"contacted_at\",\n    \"creation_state\",\n    \"executor_type\",\n    \"runner_type\",\n    \"config\",\n    \"system_xid\",\n    \"platform\",\n    \"architecture\",\n    \"revision\",\n    \"ip_address\",\n    \"version\")\n  VALUES (NEW.\"id\",\n    NEW.\"runner_id\",\n    NEW.\"sharding_key_id\",\n    NEW.\"created_at\",\n    NEW.\"updated_at\",\n    NEW.\"contacted_at\",\n    NEW.\"creation_state\",\n    NEW.\"executor_type\",\n    NEW.\"runner_type\",\n    NEW.\"config\",\n    NEW.\"system_xid\",\n    NEW.\"platform\",\n    NEW.\"architecture\",\n    NEW.\"revision\",\n    NEW.\"ip_address\",\n    NEW.\"version\")\"\nPL/pgSQL function table_sync_function_e438f29263() line 24 at SQL statement\n"

More Detail (based on this incident note):

POST requests to /runners/verify tried to insert a record into the ci_runner_machines table
The table_sync_function_e438f29263 sync trigger tried to insert the same record into group_type_ci_runner_machines_687967fa8a
The fk_rails_3f92913d27 FK constraint failed due to missing records in group_type_ci_runners_e59bb2812d table

Details on initial failed migration:

We learned that the initial migration that introduced the FK constraints partially failed when running in production. It managed to create the FKs in the main database, as the referenced tables do not contain data. When add_concurrent_foreign_key tried to do the same on the ci database, it failed on the first partition (group_type_ci_runner_machines_687967fa8a), but left that FK in a NOT VALID state instead of dropping it in an atomic fashion. The FKs for the remaining partitions were not created, and this is why only group runners were affected.

Why did the first fix not work?

The fix we initially attempted to deploy did not work as intended, leading to further delay in mitigation.

This was a result of removing the foreign key from the partitioned table, but the database engine not cascading the change to partitions themselves, meaning the issue persisted. At this point it was decided that we should manually remove the foreign keys from the ci database partitions in order to mitigate the incident, rather than waiting for the full deployment of the follow up fix.

Implications for Self Managed and Dedicated

Since this problem stemmed from the database on .com, Self-Managed and Dedicated instances were not affected. Self-hosted runners were impacted, if they were registered to .com.

CI Runners Error Ratio

Click to Expand: ci-runners Service Error Ratio metrics

source

API Error Ratio

Click to Expand: api Service Error Ratio metrics

source

Outcomes/Corrective Actions

Not necessarily a corrective action, but a potential deployment process improvement raised off the back of this incident: delivery#20691 (closed)

Learning Opportunities

What went well?

The MR that introduced the problem was identified quickly and a revert MR got opened within about an hour.

What was difficult?

No errors were logged in staging, which would have allowed blocking the deployment to production.
It took a while until the revert MR got deployed to prod (almost 6 hours from merge to deployment).
The initial revert showed structure.sql as removing the main FK + the 3 partition FKs, but only the main one actually got deleted from the prod db, which led to the need for a second fix, further delaying resolution.
The postgres.ai thin clones were out-of-date (latest copy - Data state at: 2024-10-29 11:41:44 UTC) due to the migration to PG 16, making it harder to manually test scenarios given that a lot of migrations were missing on it.
We retain 7 days of logs, at the time of writing the incident review, the logs had aged out, making final collation of impact more difficult.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
~~Announce the incident review in the incident channel on Slack.~~ - slack channel was closed before the incident review was opened.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Nov 15, 2024 by Donna Alexandra