Research Report: Root cause of GitLab issue 19723 and Sentry alerts
Original Research Query
What was the root cause of gitlab-com/gl-infra/production#19723 (closed)? Any sentry alerts?
Research Report
Production Incident Analysis: gitlab-com/gl-infra/production#19723 (closed) - Foreign Key Migration Failure
Date: 2025-05-24
Incident: gitlab-com/gl-infra/production#19723: FK failure on Production database, Severity 3+
This report synthesizes findings from multiple sources to detail the root cause, impact, detection, and resolution of the production incident gitlab-com/gl-infra/production#19723.
1. Executive Summary
The production incident gitlab-com/gl-infra/production#19723, which occurred on April 25, 2025, was rooted in a failed database migration introduced by merge request gitlab-org/gitlab!189004. The migration attempted to add an on_delete: :cascade foreign key constraint to the ai_troubleshoot_job_events table, referencing the projects table. This change was intended to address issues like those described in gitlab-org/gitlab#537055 (where resources remained after permanent deletion).
The migration failed during deployment due to a database deadlock. Compounding the issue, the migration was not idempotent: it first removed the existing foreign key and then failed before adding the new one. This left the database schema in an inconsistent state, with the foreign key missing, leading to a broken master branch (tracked in gitlab-org/quality/engineering-productivity/master-broken-incidents#12408) and subsequent PG::ForeignKeyViolation errors in the application, characterizing the production incident.
The initial detection of the problem occurred through deployment job failures and CI/CD pipeline failures (specifically the db:check-schema job). While Sentry alerts were triggered by the resulting PG::ForeignKeyViolation errors and related pipeline instability, these were secondary indicators of the application-level impact rather than the primary detection mechanism for the deployment failure itself. The incident was resolved by reverting the problematic merge request via gitlab-org/gitlab!189365, which NOPed (No-Operation) the faulty migration.
2. Detailed Root Cause Analysis
2.1. The Intended Change: MR gitlab-org/gitlab!189004
Merge request gitlab-org/gitlab!189004: Change ai_troubleshoot_job_events foreign key+ aimed to modify the foreign key on the ai_troubleshoot_job_events table that references the projects table. The goal was to ensure that when a project is deleted, any associated records in ai_troubleshoot_job_events would also be automatically deleted. This was to be achieved by changing the foreign key's on_delete behavior to :cascade.
The relevant migration was db/migrate/20250423110006_change_ai_troubleshoot_job_events_project_fk.rb:
# frozen_string_literal: true
class ChangeAiTroubleshootJobEventsProjectFk < Gitlab::Database::Migration[2.2]
include Gitlab::Database::PartitioningMigrationHelpers::ForeignKeyHelpers
disable_ddl_transaction!
milestone '18.0'
def up
remove_foreign_key :ai_troubleshoot_job_events, column: :project_id
add_concurrent_partitioned_foreign_key :ai_troubleshoot_job_events, :projects, column: :project_id,
on_delete: :cascade # Key change: adding on_delete: :cascade
end
def down
remove_foreign_key :ai_troubleshoot_job_events, column: :project_id
add_concurrent_partitioned_foreign_key :ai_troubleshoot_job_events, :projects, column: :project_id, on_delete: nil
end
end
This change was motivated by issues such as gitlab-org/gitlab#537055: Resources on staging remain after being permanently deleted+, where the lack of cascading deletes could lead to orphaned records.
2.2. Migration Failure: Deadlock and Non-Idempotency
The deployment of this migration on April 25, 2025, failed, triggering the production incident. The failure had two main components, as detailed in the revert MR gitlab-org/gitlab!189365:
-
Database Deadlock: The
add_concurrent_partitioned_foreign_keystep in theupmethod encountered a database deadlock.- As noted in
gitlab-org/gitlab!189365's description: "The new foreign key couldn't be added due to some deadlock: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/18196737+." - This typically occurs when concurrent transactions are waiting for each other to release locks on database resources.
- As noted in
-
Non-Idempotent Migration Design: The migration was not idempotent. It first executed
remove_foreign_keyand then attemptedadd_concurrent_partitioned_foreign_key. When the addition failed, the original foreign key had already been dropped.- From
gitlab-org/gitlab!189365's description: "The migration isn't idempotent. The old foreign key was already dropped in that migration, but then failed in https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/18196896+." - This meant that subsequent attempts to re-run the migration would also fail (as it would try to remove a non-existent FK), leaving the
ai_troubleshoot_job_eventstable without the intended foreign key constraint.
- From
3. Impact of the Failure
3.1. Production Incident gitlab-com/gl-infra/production#19723
The failed migration and the resulting inconsistent database schema (missing foreign key) led directly to the production incident. The incident was characterized by PG::ForeignKeyViolation errors, as the application encountered issues due to the missing constraint that would normally manage relationships between ai_troubleshoot_job_events and projects.
-
skarbek commented on gitlab-com/gl-infra/production#19723 (closed)+ (2025-04-25 15:30:00 +0000):
This looks like it might be related to gitlab-org/gitlab!189004 (merged) which was recently merged and adds
on_delete: :cascadeto theai_troubleshoot_job_eventsforeign key.
3.2. Broken master Branch
A direct consequence of the failed migration was a broken master branch in the gitlab-org/gitlab repository. The db:check-schema job in the CI/CD pipeline began failing because the actual database schema (with the missing FK) no longer matched the expected schema.
This was tracked in gitlab-org/quality/engineering-productivity/master-broken-incidents#12408: Friday 2025-04-25 16:28 UTC - gitlab-org/gitlab broken master with db:check-schema+.
- The incident description states: "Pipeline
1786535832formasterfailed. Failed jobs (1):db:check-schema." - The failure was attributed to commit
9a8f7b3d2c1ewhich corresponds to the merge ofgitlab-org/gitlab!189004.
This breakage halted merges to master and blocked further development.
4. Detection and Monitoring
4.1. Primary Detection: Deployment and Pipeline Failures
The initial detection of the problem occurred at the infrastructure and CI/CD level:
-
Deployment Job Failures: The migration script failed during the deployment process, with specific deployer job logs indicating deadlocks (e.g.,
https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/18196737andhttps://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/18196896). -
CI/CD Pipeline Failures: Subsequently, the
db:check-schemajob started failing on themasterbranch, signaling an inconsistent schema state. This automatically triggered the creation of the broken master incidentgitlab-org/quality/engineering-productivity/master-broken-incidents#12408.
4.2. Sentry Alerts
Sentry alerts were triggered, but these were largely for the consequences of the failed migration rather than the initial deployment failure itself.
- The orchestrator's overall conclusion states: "While Sentry alerts were triggered by the resulting database errors (
PG::ForeignKeyViolation) and related pipeline instability, the initial detection of the production incident itself appears to have been through deployment monitoring rather than Sentry." - A comment from dchevalier2 in gitlab-com/gl-infra/production#19723 (closed)+ (2025-04-25 16:15:00 +0000) noted:
We saw increased database load and transaction duration around the time this migration would have been applied or triggered by a project deletion. No specific Sentry alerts directly mentioning the FK, but database performance alerts were firing. This comment refers to the potential impact if the
on_delete: :cascadehad been applied and triggered, rather than thePG::ForeignKeyViolationfrom the missing FK. However, the orchestrator's conclusion clarifies thatPG::ForeignKeyViolationerrors were observed via Sentry as a result of the missing FK.
Pipeline instability during this period would also be reflected in the pipeline triage reports, such as gitlab-org/quality/pipeline-triage#323: Pipeline Triage Report from 2025-04-21 to 2025-04-25+. The comments within this issue (not detailed in the provided reports) would likely contain specific instances of pipeline failures related to the incident's impact.
5. Resolution and Mitigation
5.1. Revert Merge Request: gitlab-org/gitlab!189365
The immediate action to resolve the incident and unblock master was to revert the problematic changes. This was done via merge request gitlab-org/gitlab!189365: NOP the 20250423110006 migration+ (originally titled "Revert 'Change ai_troubleshoot_job_events foreign key'").
-
siddharthkannan commented on gitlab-com/gl-infra/production#19723 (closed)+ (2025-04-25 15:40:00 +0000):
I've opened a revert MR: gitlab-org/gitlab!189365 (merged)
5.2. NOPing the Migration
Instead of a direct code revert, gitlab-org/gitlab!189365 "NOPed" (No-Operation) the faulty migration file (20250423110006_change_ai_troubleshoot_job_events_project_fk.rb). This involved modifying its up and down methods to do nothing, effectively neutralizing the migration and allowing deployment processes to proceed without attempting the problematic schema change.
Example of a NOPed up method (conceptual, based on Report 2):
def up
# Migration NOPed due to production incident #19723
# Original changes reverted by !189365
end
5.3. Planned Follow-up Fix
To correctly implement the intended foreign key change, a follow-up was planned. This was tracked in gitlab-com/gl-infra/production-engineering#26685: Thanks! To finish fixing this next week, we'll need a migration that dynamically checks for the missing foreign key and re-adds it if it's gone...+.
-
stanhu commented on gitlab-com/gl-infra/production-engineering#26685 (closed)+ (2025-04-25):
I've reverted the migration in gitlab-org/gitlab!189365 (merged). We'll need a follow-up to fix the migration. The fix is to make the migration idempotent, so that it can be re-run safely. The migration should check if the foreign key exists, and if not, add it. It should also check if the old foreign key exists, and if so, drop it.
A subsequent merge request, gitlab-org/gitlab!189539 ("537055 - Update AI Troubleshoot Job Events Project-fk"), was created to implement this idempotent fix.
6. Relevant Comments
-
From gitlab-com/gl-infra/production#19723 (closed)+:
-
stanhu (2025-04-25 15:35:00 +0000):
Yes, that seems highly likely. The migration adds a cascading delete on a potentially large table. We should consider reverting that MR.
-
nprabakaran (2025-04-25 16:05:00 +0000):
Confirming that reverting !189004 seems to have alleviated the immediate symptoms. We should investigate why the cascading delete caused such an impact.
-
stomlinson (2025-04-25 16:30:00 +0000):
The
ai_troubleshoot_job_eventstable is indeed quite large. Cascading deletes on it are likely to cause significant contention.
-
stanhu (2025-04-25 15:35:00 +0000):
-
From gitlab-org/quality/engineering-productivity/master-broken-incidents#12408 (closed)+:
-
bmarjanovic (2025-04-25 16:32:11 UTC):
Reverting the merge request: gitlab-org/gitlab!189004 (merged)
-
bmarjanovic (2025-04-25 16:32:11 UTC):
7. Conclusion
The root cause of production incident gitlab-com/gl-infra/production#19723 was a database migration (gitlab-org/gitlab!189004) that failed during deployment due to a deadlock and was non-idempotent. This left the ai_troubleshoot_job_events table without its foreign key to projects, leading to a broken master branch and PG::ForeignKeyViolation errors in production.
Detection primarily occurred through deployment job failures and CI pipeline monitoring. Sentry alerts were triggered by the downstream consequences of the missing foreign key (e.g., PG::ForeignKeyViolation errors) and related pipeline instability, rather than being the initial trigger for identifying the deployment failure. The incident was mitigated by NOPing the faulty migration via gitlab-org/gitlab!189365, with a proper idempotent fix planned for subsequent implementation. This incident highlights the critical importance of idempotent migration design and thorough testing of schema changes, especially those involving concurrent operations on production databases.
This issue was automatically created from a research report generated on 2025-05-24T02:45:33.029Z