2021-05-18 Staging QA Failures from Feature Flag enablement blocking auto-deploy
Current Status
Timeline
View recent production deployment and configuration events (internal only)
All times UTC.
2021-05-18
-
11:57
- Feature flags were changed -
13:15
- QA begins on staging on package13.12.202105181120-33114d58a28.47f613a7e2d
-
13:25
- QA fails - https://ops.gitlab.net/gitlab-org/quality/staging/-/jobs/3889664 -
13:29
- QA is retried (automated) - but fails again -
13:40
- QA is retried (automated) - but fails again -
15:33
- Release Manager Reaches out to QA on-call -
16:22
- @skarbek declares incident in Slack. -
16:53
- It was discovered that the prior package13.12-64cc235edba
showed similar behavior AFTER a feature flag was swapped -
18:06
- Attempts to reverse the feature flag failed to change due to #4632 (closed) -
18:52
- The feature flags are reversed -
19:07
- QA is noted to have passed -
20:56
- Release Manager confirms staging is in a healthy state and resumes Auto-Deployment activities
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- QA runs should be executed post feature flag changes to capture changes
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Summary
Feature flags were changed:
- https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/5522
- https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/5523
And QA began to fail on Staging.
- Service(s) affected: ServiceGitLab Rails
- Team attribution: ~"group::access"
- Time to detection: 88 minutes
- Minutes downtime or degradation: 8 hours
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- QA Failures
-
How many customers were affected?
- 0
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Failed Requests
What were the root causes?
use_traveral_ids
feature flag relies on having traversal_ids
synced, which is controlled by another feature flag: sync_traversal_ids
. We enabled the sync_traversal_ids
for gitlab-org
group via chatops, and after verifying the behavior, we switched default_enabled
to true
in the YAML definition of the feature flag.
Our expectation was that the sync_traversal_ids
is enabled by default. What we missed is that enabling the feature flag for a single group in chatops sets the feature flag globally disabled and overrides the YAML definition.
So enabling use_traveral_ids
for gitlab-qa-sandbox-group
did not work as expected, because the sync_traversal_ids
for that group was still off, and this caused the pipeline failures.
Running /chatops run feature delete sync_traversal_ids --staging
fixed the issue, and we were able to re-enable the use_traveral_ids
feature flag.
Incident Response Analysis
-
How was the incident detected?
- QA Failures reported by Deployer
-
How could detection time be improved?
- QA Failures that happen after a feature flag is modified
-
How was the root cause diagnosed?
- I could not replicate it locally, but I ran the failing specs against the staging environment with various feature flag settings.
-
How could time to diagnosis be improved?
- TODO
-
How did we reach the point where we knew how to mitigate the impact?
- TODO
-
How could time to mitigation be improved?
- TODO
-
What went well?
- TODO
Post Incident Analysis
- Did we have other events in the past with the same root cause?
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- TODO
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Feature Flag for code: gitlab-org/gitlab!56296 (merged)
Lessons Learned
- Interaction between chatops and YAML definition of the feature flag needs some clarification.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)