RCA - Sidekiq shard_urgent_cpu_bound saturated after enabling multi_pipeline_scan_result_policies feature flag (#387556) · Issues · GitLab.org / GitLab

RCA - Sidekiq shard_urgent_cpu_bound saturated after enabling multi_pipeline_scan_result_policies feature flag

**Please note:** if the incident relates to sensitive data, or is security related consider labeling this issue with gitlab-com/gl-infra/reliability~9340546 and mark it confidential. *** ## Summary Related incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8159+ After enabling `multi_pipeline_scan_result_policies` feature flag we have noticed delays in background processing for approximately 45 minutes that impacted both web and api. The primary impact was slower processing of CI jobs, issues, merge requests and notes. The identified root cause was a feature-flag rollout https://gitlab.com/gitlab-org/gitlab/-/issues/382990+. Service(s) affected : Web, API, Sidekiq Team attribution: gitlab-com/gl-infra/reliability~12612686 Minutes downtime or degradation: `2022-12-14` `10:00 UTC` -`10:45 UTC` (45min) ## Impact & Metrics Start with the following: | Question | Answer | | ----- | ----- | | What was the impact | delays in background processing that impacted both web and api | | Who was impacted | external/internal customers | | How did this impact customers | Slower processing of CI jobs, issues, merge requests and notes | | How many attempts made to access | - | | How many customers affected | - | | How many customers tried to access | - | ### Graphs #### Affected Sidekiq Jobs ![image1](/uploads/1ec3af23eb644ec208ce875527e8a707/image1.png) #### Queue length after feature was enabled ![image2](/uploads/113bef1b0435a57164a54276718a89e0/image2.png) #### Apdex for Sidekiq service when feature flag was enabled ![image3](/uploads/9650f386ddb515342af705f21d0c9973/image3.png) ## Detection & Response Start with the following: | Question | Answer | | ----- | ----- | | When was the incident detected? | 2022-12-14 10:20 UTC | | How was the incident detected? | Alertmanager alert | | Did alarming work as expected? | Yes | | How long did it take from the start of the incident to its detection? | Apdex started dropping from 10:06 UTC | | How long did it take from detection to remediation? | Feature flag was disabled at 10:29 UTC (9 minutes) | | What steps were taken to remediate? | Created slack channel and zoom call for incident, disabling feature flags, observing charts for improvements after feature flag was disabled | | Were there any issues with the response? | No | ## MR Checklist Consider these questions if a code change introduced the issue. | Question | Answer | | ----- | ----- | | Was the [MR acceptance checklist](https://docs.gitlab.com/ee/development/code_review.html#acceptance-checklist) marked as reviewed in the MR? | Yes | | Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? | No, the checklist for feature flag rollout should be updated to reduce chances of future recurrences | ## Timeline 2022-11-30 - 21:18 UTC - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/103283 that introduced the feature flag and the code change was merged 2022-12-01 - 10:45 UTC - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/103283 was deployed on production 2022-12-05 - 12:39 UTC - `multi_pipeline_scan_result_policies` feature flag was enabled on dev/staging/staging-ref environments => no impact on apdex was observed 2022-12-12 - 13:19 UTC - `multi_pipeline_scan_result_policies` feature was enabled on `gitlab-org/gitlab`, `gitlab-org/gitlab-foss`, `gitlab-com/www-gitlab-com`, `gitlab-org/govern/demos/sandbox/issue-379108-verification` projects, - 13:29 UTC - `multi_pipeline_scan_result_policies` feature was disabled on `gitlab-org/gitlab`, `gitlab-org/gitlab-foss`, `gitlab-com/www-gitlab-com`, `gitlab-org/govern/demos/sandbox/issue-379108-verification` projects as it was not working as expected (no timeouts were discovered or any other impact on apdex), 2022-12-13 - 07:07 UTC - Teleport request was created to investigate why the feature is not working as expected (https://gitlab.slack.com/archives/CB3LSMEJV/p1670915267750039), root cause was identified (timeout in the query to fetch pipelines with given `sha`), - 09:12 UTC - new MR that fixes identified problem was created (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/106793), - 17:02 UTC - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/106793 was merged, 2022-12-14 - 01:02 UTC - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/106793 was deployed on production enviroment, - 09:14 UTC - `multi_pipeline_scan_result_policies` feature was enabled on `gitlab-org/gitlab`, `gitlab-org/gitlab-foss`, `gitlab-com/www-gitlab-com`, `gitlab-org/govern/demos/sandbox/issue-379108-verification` projects, - 09:16 UTC - https://gitlab.com/gitlab-org/govern/demos/sandbox/issue-379108-verification/-/merge_requests/5 and https://gitlab.com/gitlab-org/govern/demos/sandbox/issue-379108-verification/-/merge_requests/6 MRs were created to confirm that the feature is working as expected, - 09:28 UTC - `multi_pipeline_scan_result_policies` feature flag was enabled for 25% of actors - 09:37 UTC - `multi_pipeline_scan_result_policies` feature flag was enabled for 50% of actors - 09:41 UTC - `multi_pipeline_scan_result_policies` feature flag was enabled for 75% of actors - 09:45 UTC - `multi_pipeline_scan_result_policies` feature flag was enabled globally on GitLab.com - 10:06 UTC - apdex for sidekiq service started dropping - 10:18 UTC - apdex dropped to ~92% - 10:20 UTC - alert from Alertmanager on #production channel (https://gitlab.slack.com/archives/C101F3796/p1671013208117429) - 10:24 UTC - incident was declared and zoom call started to investigate the cause, - 10:29 UTC - `multi_pipeline_scan_result_policies` feature flag was disabled globally on GitLab.com, - 10:41 UTC - status page was updated with Degraded Performance information with Investigating status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7) - 10:51 UTC - apdex for sidekiq recovered - 10:51 UTC - status page was updated with Degraded Performance information with Monitoring status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7) - 11:23 UTC - status page was updated with Degraded Performance information with Resolved status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7) ## Root Cause Analysis The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can **never be a person**, the way of writing has to refer to the system and the context rather than the specific actors. Follow the "**5 whys**" in a **blameless** manner as the core of the root-cause analysis. For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause. Keep in mind that from one "why?" there may come more than one answer, consider following the different branches. ### Example of the usage of "5 whys" The vehicle will not start. (the problem) 1. Why? - The battery is dead. 2. Why? - The alternator is not functioning. 3. Why? - The alternator belt has broken. 4. Why? - The alternator belt was well beyond its useful service life and not replaced. 5. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause) ## What went well * new feature was introduced behind feature flag that allowed to quickly disable the feature, * monitoring system worked perfectly with declaring new incident, * SRE and Support teams quickly identified the cause of the dropped performance, ## What can be improved * we should encourage engineers to wait at least for 15 minutes before enabling feature flag for more actors: * during `Production Check`in Chatops we could add additional check that prevents from moving forward with enabling feature flag for more actors when requested within 15 minutes after last update, * we could add information to [feature flag rollout template](https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Feature%20Flag%20Roll%20Out.md) with the information that you should wait at least 15 minutes, * we should encourage engineers to observe https://dashboards.gitlab.net after feature flag is enabled, * after feature flag is enabled, Chatops could respond on #production channel with the link to https://dashboards.gitlab.net and handbook page on how to observe metrics after enabling feature flag, * we could add information to [feature flag rollout template](https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Feature%20Flag%20Roll%20Out.md) with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe, * we should help engineers find good resources on how to read metrics from https://dashboards.gitlab.net and look for metrics that are related to introduced change, * as the issue was not reproducible on staging environment we might look for ways to increase the traffic there to be able to properly evaluate how given change affects the environment, * we should investigate potential impact on performance during the development/testing of the MR,  ## Corrective actions ### Proposed actions: * https://gitlab.com/gitlab-com/chatops: * update production check to verify if the feature was not enabled for actors within last 15 minutes, * update response from Chatops bot to include link with https://dashboards.gitlab.net/ and link to handbook about proposed dashboards to observe, * https://gitlab.com/gitlab-org/gitlab: * update [feature flag rollout template](https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Feature%20Flag%20Roll%20Out.md) with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe, * update [feature flag rollout template](https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Feature%20Flag%20Roll%20Out.md) with the information that you should wait at least 15 minutes after each step when incrementally enabling feature flag, * add more traffic to staging environment (ie. generate fake traffic or replicate traffic that from `gitlab-com` and `gitlab-org` groups), * https://gitlab.com/gitlab-com/www-gitlab-com: * add handbook update with explanation on how to read metrics from https://dashboards.gitlab.net after enabling feature flag or deployment of the feature, collaborate with SRE team and record the video about it and additionally we could create a LevelUp course and encourage teams to attend,  ### Planned actions: As we have created [FCL](https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34) as a result of this RCA and to keep FCL as single source of truth for our work, we are keeping the work needed to do corrective actions here: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34#work-plan. ## Guidelines - [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html) - [5 whys](https://en.wikipedia.org/wiki/5_Whys)

issue