Approved MR blocked 4 days by unrelated CI failures — data points and process questions
## Summary [!225247](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225247) (group secret rotation model, part of [#577344](https://gitlab.com/gitlab-org/gitlab/-/issues/577344)) had all 5 required approvals by **Mar 5**. It merged **Mar 9** — 4 calendar days later — after 7 failed pipelines and 52 failed jobs. None of the failures were caused by the MR itself. Raising this as a concrete data point to help improve our processes. I know everyone is working hard on this. ## What failed ### 1. Schema migration mismatch from another MR `db:check-migrations` failed in 4 pipelines. [!222342](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/222342) updated a migration timestamp but missed pushing the `db/schema_migrations` file. Fixed by [!226188](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/226188) (merged Mar 6 13:16 UTC, ~2 days after our approvals were in). Tracked in pipeline incident [#15355](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15355). ### 2. CI infra / environment setup failures Pipeline incident [#15590](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15590) shows a pipeline where 16 jobs (clone, cache, db setup, build, etc.) all failed at once. Retried twice by the bot, failed both times. This is CI infra trouble, not MR logic. Separately, `gdk-update` jobs failed repeatedly with `fatal: remote error: GitLab is currently unable to handle this request due to load`. ### 3. Flaky system spec Master-broken incidents [#720](https://gitlab.com/gitlab-org/quality/analytics/master-broken-incidents/-/issues/720) and [#721](https://gitlab.com/gitlab-org/quality/analytics/master-broken-incidents/-/issues/721) show `spec/features/work_items/issues/new/user_creates_issue_spec.rb` blocking 50-56 pipelines and 36-41 MRs. DevEx single-test dashboard shows ~29% pass rate. Culprit was [!225585](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225585), fixed by [!226361](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/226361) (forward fix merged Mar 7 00:21 UTC; revert [!226296](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/226296) was prepared but closed in favor of the fix). ### 4. Approval reset on rebase To pick up the fixes above, we rebased on Mar 6. All 5 approvals were reset. A pipeline on Mar 7 ([#2370151322](https://gitlab.com/gitlab-org/gitlab/-/pipelines/2370151322)) actually **passed green**, but the MR could not merge because approvals had not been restored. By the time reviewers re-approved across timezones and a new pipeline ran (Mar 8), that pipeline hit the 16-job infra meltdown ([#15590](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15590)), adding another day. This is a known pain point tracked under epic [&544 - Smarter approval resets](https://gitlab.com/groups/gitlab-org/-/work_items/544) (open since 2018). ## Impact - **4 calendar days** from approval to merge - **~10.5 hours** of CI compute burned across 7 pipelines - **Multiple engineers** spent time investigating, retrying, and waiting instead of moving on - **4 dependent MRs** in the group secrets rotation series ([!225338](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225338), [!225562](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225562), [!225935](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225935), [!226133](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/226133)) were blocked ## Questions ### Fast-tracking approved MRs When an approved MR has (a) an open pipeline incident pointing at another MR, (b) an open master-broken or master-flaky incident for the same job/test, and (c) dashboards showing many affected pipelines/MRs - is there a way to fast-track already-approved MRs? For example, a documented "safe to merge" path or a supported bypass of the known-broken job, instead of asking teams to keep retrying and hoping? ### Master-breaking spec response time For master-breaking specs that block dozens of pipelines and MRs within hours, what is the target response time to (1) identify the culprit MR, (2) fix or revert, and (3) quarantine the test if needed? Based on what I know, we are planning to default to fast-quarantine once impact crosses a threshold. ### Infrastructure visibility Between Mar 5-8 we saw repeated `fatal: ... unable to handle this request due to load` and the 16-job meltdown in [#15590](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15590). Were there known capacity or infra incidents in that window? Is there a way to surface that more clearly on affected pipelines so maintainers can make a safer call on merging approved MRs once infra stabilizes? ### Approval reset epic Are there any plans to revive [&544 - Smarter approval resets](https://gitlab.com/groups/gitlab-org/-/work_items/544) (open since 2018, due date Mar 13 2026)? ## Related incidents and references - Approved MR pipeline incidents: [#15355](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15355), [#15550](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15550), [#15590](https://gitlab.com/gitlab-org/quality/engineering-productivity/approved-mr-pipeline-incidents/-/issues/15590) - Master-broken incidents: [#720](https://gitlab.com/gitlab-org/quality/analytics/master-broken-incidents/-/issues/720), [#721](https://gitlab.com/gitlab-org/quality/analytics/master-broken-incidents/-/issues/721) - Flaky spec culprit: [!225585](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225585), fix [!226361](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/226361) - Approval reset epic: [&544](https://gitlab.com/groups/gitlab-org/-/work_items/544) /cc @ddieulivol @pjphillips @m_gill @grzesiek @fcatteau @amyphillips @andrewn
epic