2024-10-31: Runner verification API returning 500
Customer Impact
From 04:00 UTC to 22:05 UTC on Thursday, Oct 31 an average of 0.2% of runners were impacted by this incident.
Requests to POST /runners/verify endpoint were observed returning 500 errors.
This resulted in issues starting new runners or existing runners that haven't been active for the past 7 days.
Current Status
We were seeing elevated error rates on the POST /runners/verify endpoint which seems to be related to a DB migration introduced in gitlab-org/gitlab!168131 (diffs)
As of 2024-11-01 09:52 UTC:
- The first revert MR was deployed, but we found we needed another change, as the FK constraint removed from the parent routing table unexpectedly didn't cause the FK constraint to be removed from the child partition tables.
- The second FK constraint removal MR was created and is in review.
- A Change Request was completed to manually removing the FK constraints in the affected
citable and mitigate customer impact. - There is a scheduled Postgres upgrade this weekend with a Production Change Lock, meaning:
- The MR to remove the FK constraint will be deployed on Tuesday, Nov 05.
- The post deploy migration to apply this change will run on Wednesday, Nov 06.
- It has been determined that this poses no risks to the Postgres upgrade because this was limited to the
cidatabase, and the upgrade is scheduled on themaindatabase.
📝 Summary for CMOC notice / Exec summary:
- Customer Impact: roughly 550 top-level namespaces were impacted. The error ratio for ci-runners was roughly 0.2% during the period of this incident.
- Service Impact: Runners API, specifically any new runners and existing runners that haven't been active for the past 7 days and dropped out of the DB.
- Impact Duration:
04:00am UTC-22:05 UTC - Root cause: Migrations merged (and therefore run) out of order. See @pedropombeiro's discussion: #18792 (comment 2187473313)
Steps to resolve once incident is closed
- Once Backfill ci_runners_e59bb2812d table (gitlab-org/gitlab!166520 - merged) is merged and deployed.
- Unrevert the reverted migration that we used to mitigate the incident gitlab-org/gitlab!171246 (merged)
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Security Note: If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.