2021-05-18 Staging QA Failures from Feature Flag enablement blocking auto-deploy

Current Status

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-05-18

11:57 - Feature flags were changed
13:15 - QA begins on staging on package 13.12.202105181120-33114d58a28.47f613a7e2d
13:25 - QA fails - https://ops.gitlab.net/gitlab-org/quality/staging/-/jobs/3889664
13:29 - QA is retried (automated) - but fails again
13:40 - QA is retried (automated) - but fails again
15:33 - Release Manager Reaches out to QA on-call
16:22 - @skarbek declares incident in Slack.
16:53 - It was discovered that the prior package 13.12-64cc235edba showed similar behavior AFTER a feature flag was swapped
18:06 - Attempts to reverse the feature flag failed to change due to #4632 (closed)
18:52 - The feature flags are reversed
19:07 - QA is noted to have passed
20:56 - Release Manager confirms staging is in a healthy state and resumes Auto-Deployment activities

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

QA runs should be executed post feature flag changes to capture changes

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

Feature flags were changed:

And QA began to fail on Staging.

Service(s) affected: ServiceGitLab Rails
Team attribution: ~"group::access"
Time to detection: 88 minutes
Minutes downtime or degradation: 8 hours

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Internal
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. QA Failures
How many customers were affected?
1. 0
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Failed Requests

What were the root causes?

use_traveral_ids feature flag relies on having traversal_ids synced, which is controlled by another feature flag: sync_traversal_ids. We enabled the sync_traversal_ids for gitlab-org group via chatops, and after verifying the behavior, we switched default_enabled to true in the YAML definition of the feature flag.

Our expectation was that the sync_traversal_ids is enabled by default. What we missed is that enabling the feature flag for a single group in chatops sets the feature flag globally disabled and overrides the YAML definition.

So enabling use_traveral_ids for gitlab-qa-sandbox-group did not work as expected, because the sync_traversal_ids for that group was still off, and this caused the pipeline failures.

Running /chatops run feature delete sync_traversal_ids --staging fixed the issue, and we were able to re-enable the use_traveral_ids feature flag.

Incident Response Analysis

How was the incident detected?
1. QA Failures reported by Deployer
How could detection time be improved?
1. QA Failures that happen after a feature flag is modified
How was the root cause diagnosed?
1. I could not replicate it locally, but I ran the failing specs against the staging environment with various feature flag settings.
How could time to diagnosis be improved?
1. TODO
How did we reach the point where we knew how to mitigate the impact?
1. TODO
How could time to mitigation be improved?
1. TODO
What went well?
1. TODO

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Yes: https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=closed&label_name[]=RootCause%3A%3AFeature-Flag
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. TODO
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Feature Flag for code: gitlab-org/gitlab!56296 (merged)

Lessons Learned

Interaction between chatops and YAML definition of the feature flag needs some clarification.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jun 04, 2021 by John Skarbek