RCA - Sidekiq shard_urgent_cpu_bound saturated after enabling multi_pipeline_scan_result_policies feature flag
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with gitlab-com/gl-infra/reliability~9340546 and mark it confidential.
Summary
Related incident: 2022-12-14: Sidekiq shard_urgent_cpu_bound satu... (gitlab-com/gl-infra/production#8159 - closed)
After enabling multi_pipeline_scan_result_policies feature flag we have noticed delays in background processing for approximately 45 minutes that impacted both web and api. The primary impact was slower processing of CI jobs, issues, merge requests and notes.
The identified root cause was a feature-flag rollout [Feature flag] Rollout of `multi_pipeline_scan_... (#382990 - closed).
Service(s) affected : Web, API, Sidekiq
Team attribution: gitlab-com/gl-infra/reliability~12612686
Minutes downtime or degradation: 2022-12-14 10:00 UTC -10:45 UTC (45min)
Impact & Metrics
Start with the following:
| Question | Answer |
|---|---|
| What was the impact | delays in background processing that impacted both web and api |
| Who was impacted | external/internal customers |
| How did this impact customers | Slower processing of CI jobs, issues, merge requests and notes |
| How many attempts made to access | - |
| How many customers affected | - |
| How many customers tried to access | - |
Graphs
Affected Sidekiq Jobs
Queue length after feature was enabled
Apdex for Sidekiq service when feature flag was enabled
Detection & Response
Start with the following:
| Question | Answer |
|---|---|
| When was the incident detected? | 2022-12-14 10:20 UTC |
| How was the incident detected? | Alertmanager alert |
| Did alarming work as expected? | Yes |
| How long did it take from the start of the incident to its detection? | Apdex started dropping from 10:06 UTC |
| How long did it take from detection to remediation? | Feature flag was disabled at 10:29 UTC (9 minutes) |
| What steps were taken to remediate? | Created slack channel and zoom call for incident, disabling feature flags, observing charts for improvements after feature flag was disabled |
| Were there any issues with the response? | No |
MR Checklist
Consider these questions if a code change introduced the issue.
| Question | Answer |
|---|---|
| Was the MR acceptance checklist marked as reviewed in the MR? | Yes |
| Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? | No, the checklist for feature flag rollout should be updated to reduce chances of future recurrences |
Timeline
2022-11-30
- 21:18 UTC - !103283 (merged) that introduced the feature flag and the code change was merged
2022-12-01
- 10:45 UTC - !103283 (merged) was deployed on production
2022-12-05
- 12:39 UTC -
multi_pipeline_scan_result_policiesfeature flag was enabled on dev/staging/staging-ref environments => no impact on apdex was observed
2022-12-12
- 13:19 UTC -
multi_pipeline_scan_result_policiesfeature was enabled ongitlab-org/gitlab,gitlab-org/gitlab-foss,gitlab-com/www-gitlab-com,gitlab-org/govern/demos/sandbox/issue-379108-verificationprojects, - 13:29 UTC -
multi_pipeline_scan_result_policiesfeature was disabled ongitlab-org/gitlab,gitlab-org/gitlab-foss,gitlab-com/www-gitlab-com,gitlab-org/govern/demos/sandbox/issue-379108-verificationprojects as it was not working as expected (no timeouts were discovered or any other impact on apdex),
2022-12-13
- 07:07 UTC - Teleport request was created to investigate why the feature is not working as expected (https://gitlab.slack.com/archives/CB3LSMEJV/p1670915267750039), root cause was identified (timeout in the query to fetch pipelines with given
sha), - 09:12 UTC - new MR that fixes identified problem was created (!106793 (merged)),
- 17:02 UTC - !106793 (merged) was merged,
2022-12-14
- 01:02 UTC - !106793 (merged) was deployed on production enviroment,
- 09:14 UTC -
multi_pipeline_scan_result_policiesfeature was enabled ongitlab-org/gitlab,gitlab-org/gitlab-foss,gitlab-com/www-gitlab-com,gitlab-org/govern/demos/sandbox/issue-379108-verificationprojects, - 09:16 UTC - gitlab-org/govern/demos/sandbox/issue-379108-verification!5 (closed) and gitlab-org/govern/demos/sandbox/issue-379108-verification!6 (closed) MRs were created to confirm that the feature is working as expected,
- 09:28 UTC -
multi_pipeline_scan_result_policiesfeature flag was enabled for 25% of actors - 09:37 UTC -
multi_pipeline_scan_result_policiesfeature flag was enabled for 50% of actors - 09:41 UTC -
multi_pipeline_scan_result_policiesfeature flag was enabled for 75% of actors - 09:45 UTC -
multi_pipeline_scan_result_policiesfeature flag was enabled globally on GitLab.com - 10:06 UTC - apdex for sidekiq service started dropping
- 10:18 UTC - apdex dropped to ~92%
- 10:20 UTC - alert from Alertmanager on #production channel (https://gitlab.slack.com/archives/C101F3796/p1671013208117429)
- 10:24 UTC - incident was declared and zoom call started to investigate the cause,
- 10:29 UTC -
multi_pipeline_scan_result_policiesfeature flag was disabled globally on GitLab.com, - 10:41 UTC - status page was updated with Degraded Performance information with Investigating status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)
- 10:51 UTC - apdex for sidekiq recovered
- 10:51 UTC - status page was updated with Degraded Performance information with Monitoring status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)
- 11:23 UTC - status page was updated with Degraded Performance information with Resolved status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.
For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
Example of the usage of "5 whys"
The vehicle will not start. (the problem)
- Why? - The battery is dead.
- Why? - The alternator is not functioning.
- Why? - The alternator belt has broken.
- Why? - The alternator belt was well beyond its useful service life and not replaced.
- Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
What went well
- new feature was introduced behind feature flag that allowed to quickly disable the feature,
- monitoring system worked perfectly with declaring new incident,
- SRE and Support teams quickly identified the cause of the dropped performance,
What can be improved
- we should encourage engineers to wait at least for 15 minutes before enabling feature flag for more actors:
- during
Production Checkin Chatops we could add additional check that prevents from moving forward with enabling feature flag for more actors when requested within 15 minutes after last update, - we could add information to feature flag rollout template with the information that you should wait at least 15 minutes,
- during
- we should encourage engineers to observe https://dashboards.gitlab.net after feature flag is enabled,
- after feature flag is enabled, Chatops could respond on #production channel with the link to https://dashboards.gitlab.net and handbook page on how to observe metrics after enabling feature flag,
- we could add information to feature flag rollout template with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe,
- we should help engineers find good resources on how to read metrics from https://dashboards.gitlab.net and look for metrics that are related to introduced change,
- as the issue was not reproducible on staging environment we might look for ways to increase the traffic there to be able to properly evaluate how given change affects the environment,
- we should investigate potential impact on performance during the development/testing of the MR,
Corrective actions
Proposed actions:
-
https://gitlab.com/gitlab-com/chatops:
- update production check to verify if the feature was not enabled for actors within last 15 minutes,
- update response from Chatops bot to include link with https://dashboards.gitlab.net/ and link to handbook about proposed dashboards to observe,
-
https://gitlab.com/gitlab-org/gitlab:
- update feature flag rollout template with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe,
- update feature flag rollout template with the information that you should wait at least 15 minutes after each step when incrementally enabling feature flag,
- add more traffic to staging environment (ie. generate fake traffic or replicate traffic that from
gitlab-comandgitlab-orggroups),
-
https://gitlab.com/gitlab-com/www-gitlab-com:
- add handbook update with explanation on how to read metrics from https://dashboards.gitlab.net after enabling feature flag or deployment of the feature, collaborate with SRE team and record the video about it and additionally we could create a LevelUp course and encourage teams to attend,
Planned actions:
As we have created FCL as a result of this RCA and to keep FCL as single source of truth for our work, we are keeping the work needed to do corrective actions here: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34#work-plan.


