2020-11-05: Merge requests failing with "Conflicts detected during merge. Please try again"
Summary
We've received multiple reports of users being unable to merge via the UI, receiving the error Merge failed: Conflicts detected during merge. Please try again.
Merging manually through the command line appears to still be possible.
Zendesk Reports (GitLab Internal)
- https://gitlab.zendesk.com/agent/tickets/179504
- https://gitlab.zendesk.com/agent/tickets/179475
- https://gitlab.zendesk.com/agent/tickets/179471
- https://gitlab.zendesk.com/agent/tickets/179521
Timeline
All times UTC.
2020-11-05
- 08:54 - @pks-t sets
gitaly_go_user_merge_branch
feature flag - 18:43 - tristan declares incident in Slack.
- 20:04 - @stanhu identified a temporally correlated feature flag change,
gitaly_go_user_squash
Slack - 20:06 - @AnthonySandoval deletes the feature flag
gitaly_go_user_squash
Slack - 20:13 - @AnthonySandoval deletes the feature flag
gitaly_go_user_merge_to_ref
Slack - 20:23 - @AnthonySandoval deletes the feature flag
gitaly_go_list_conflict_files
Slack - 20:25 - @anthonysandoval deletes the feature flag
gitaly_go_user_merge_branch
Slack. This mitigated the issue.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For 11 hours and 31 minutes (between 2020-11-05 08:54 UTC and 2020-11-05 20:25 UTC) users with protected branches were unable to perform merges via the Web UI, due to a buggy feature being enabled during that period. Disabling that feature resolved the incident.
- Service(s) affected: ServiceGitaly
- Team attribution: Gitaly
- Minutes downtime or degradation: ~11 hours and 31 minutes
Metrics
https://log.gprd.gitlab.net/goto/6fd84d8585ec4af9090483e3b5b070c7
Customer Impact
- Who was impacted by this incident? Users attempting to merge MRs on protected branches.
- What was the customer experience during the incident? Users received a permission error.
- How many customers were affected? 288 MR attempts
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? Escalation from Customer Support based on user reports
- How could detection time be improved? We don't have alerting for these errors since they're usually legitimate user errors of users trying to perform merges disabled by protected branches settings. Even if we did alert on, say, the rate of these errors, most likely this incident wouldn't have triggered such an alert anyways, since it had a very small impact on the overall error rate given the boundaries of the issue.
- How did we reach the point where we knew how to mitigate the impact? @stanhu pointed out feature flags turned on earlier that day, that were potentially related to the incident based on the error messages we were seeing.
- How could time to mitigation be improved? Suspicions were initially directed at code changes emanating from a recent deploy, until attention was driven to the recent feature flag changes.
Post Incident Analysis
- How was the root cause diagnosed? Given multiple gitaly-related feature flag changes occurred on the same day of the incident, the approach used was trial and error by disabling those feature flags one by one and observing if errors kept flowing in after the change. Once the problematic feature flag was identified, the backend fix was investigated in gitlab-org/gitaly#3271 (closed) without need of additional production data
- How could time to diagnosis be improved? Bringing the feature flag changes to attention could've happened sooner
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? Yes: gitlab-org/gitaly!2540 (merged)
5 Whys
- Users with protected branches where unable to merge via the Web UI, why?
- The implementation of
user_merge_branch
in Gitaly had a bug where theGL_PROTOCOL
variable wasn't correctly set in the git hooks, leading to a false positive validation ofCODEOWNERS
permissions
- Why did this bug not get noticed in unit tests in gitlab-rails?
-
The spec for
DiffCheck
does not validate the codeowners code path. An example for that path should be added, though such an example would have as a precondition that theGL_PROTOCOL
is set correctly so it wouldn't have caught the error that caused this incident.
- Why did this bug not get noticed in unit tests in gitaly?
-
The tests for
UserMergeBranch
did not check whetherGL_PROTOCOL
was set correctly. That test was added, together with the fix, on gitlab-org/gitaly!2753 (merged)
- Why did this bug not get noticed in our E2E testing?
- The integration test for this use case is missing. The existing merge request E2E tests do not consider protected branches restrictions.
Lessons Learned
- When a bug generates an "expected" user validation error, instead of an uncaught error, it's hard to detect since we intentionally don't alert on such errors.
- Our chatops commands for feature flags could be refined to require more context for each change.
- Our E2E testing has a blind spot regarding merge requests with
CODEOWNERS
Corrective Actions
- chatops should require an issue link with more documentation explaining the feature flag that's being set/updated and a link to the relevant code. gitlab-com/chatops#97 (closed)
- Some training for EOC/IMOC should be documented on how best to rollback feature flags: do we
set false
orset 0
or is adelete
sufficient https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11977 -
/chatops run feature delete
commands should be logged in the feature-flag-log. - Add an E2E test case for Merge Requests with
CODEOWNERS
gitlab-org/quality/testcases#1090 (closed) - Enforce parameter types in internal/allowed endpoint - gitlab-org/gitaly#3255 (closed)
- Improve input validation of hook inputs - gitlab-org/gitaly#3255 (closed)
Guidelines
Edited by Alejandro Rodríguez