RCA - Sidekiq shard_urgent_cpu_bound saturated after enabling multi_pipeline_scan_result_policies feature flag

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with gitlab-com/gl-infra/reliability~9340546 and mark it confidential.

Summary

After enabling multi_pipeline_scan_result_policies feature flag we have noticed delays in background processing for approximately 45 minutes that impacted both web and api. The primary impact was slower processing of CI jobs, issues, merge requests and notes.

The identified root cause was a feature-flag rollout [Feature flag] Rollout of `multi_pipeline_scan_... (#382990 - closed).

Service(s) affected : Web, API, Sidekiq Team attribution: gitlab-com/gl-infra/reliability~12612686 Minutes downtime or degradation: 2022-12-14 10:00 UTC -10:45 UTC (45min)

Impact & Metrics

Start with the following:

Question	Answer
What was the impact	delays in background processing that impacted both web and api
Who was impacted	external/internal customers
How did this impact customers	Slower processing of CI jobs, issues, merge requests and notes
How many attempts made to access	-
How many customers affected	-
How many customers tried to access	-

Graphs

Affected Sidekiq Jobs

Queue length after feature was enabled

Apdex for Sidekiq service when feature flag was enabled

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	2022-12-14 10:20 UTC
How was the incident detected?	Alertmanager alert
Did alarming work as expected?	Yes
How long did it take from the start of the incident to its detection?	Apdex started dropping from 10:06 UTC
How long did it take from detection to remediation?	Feature flag was disabled at 10:29 UTC (9 minutes)
What steps were taken to remediate?	Created slack channel and zoom call for incident, disabling feature flags, observing charts for improvements after feature flag was disabled
Were there any issues with the response?	No

MR Checklist

Consider these questions if a code change introduced the issue.

Question	Answer
Was the MR acceptance checklist marked as reviewed in the MR?	Yes
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so?	No, the checklist for feature flag rollout should be updated to reduce chances of future recurrences

Timeline

2022-11-30

21:18 UTC - !103283 (merged) that introduced the feature flag and the code change was merged

2022-12-01

10:45 UTC - !103283 (merged) was deployed on production

2022-12-05

12:39 UTC - multi_pipeline_scan_result_policies feature flag was enabled on dev/staging/staging-ref environments => no impact on apdex was observed

2022-12-12

13:19 UTC - multi_pipeline_scan_result_policies feature was enabled on gitlab-org/gitlab, gitlab-org/gitlab-foss, gitlab-com/www-gitlab-com, gitlab-org/govern/demos/sandbox/issue-379108-verification projects,
13:29 UTC - multi_pipeline_scan_result_policies feature was disabled on gitlab-org/gitlab, gitlab-org/gitlab-foss, gitlab-com/www-gitlab-com, gitlab-org/govern/demos/sandbox/issue-379108-verification projects as it was not working as expected (no timeouts were discovered or any other impact on apdex),

2022-12-13

07:07 UTC - Teleport request was created to investigate why the feature is not working as expected (https://gitlab.slack.com/archives/CB3LSMEJV/p1670915267750039), root cause was identified (timeout in the query to fetch pipelines with given sha),
09:12 UTC - new MR that fixes identified problem was created (!106793 (merged)),
17:02 UTC - !106793 (merged) was merged,

2022-12-14

01:02 UTC - !106793 (merged) was deployed on production enviroment,
09:14 UTC - multi_pipeline_scan_result_policies feature was enabled on gitlab-org/gitlab, gitlab-org/gitlab-foss, gitlab-com/www-gitlab-com, gitlab-org/govern/demos/sandbox/issue-379108-verification projects,
09:16 UTC - gitlab-org/govern/demos/sandbox/issue-379108-verification!5 (closed) and gitlab-org/govern/demos/sandbox/issue-379108-verification!6 (closed) MRs were created to confirm that the feature is working as expected,
09:28 UTC - multi_pipeline_scan_result_policies feature flag was enabled for 25% of actors
09:37 UTC - multi_pipeline_scan_result_policies feature flag was enabled for 50% of actors
09:41 UTC - multi_pipeline_scan_result_policies feature flag was enabled for 75% of actors
09:45 UTC - multi_pipeline_scan_result_policies feature flag was enabled globally on GitLab.com
10:06 UTC - apdex for sidekiq service started dropping
10:18 UTC - apdex dropped to ~92%
10:20 UTC - alert from Alertmanager on #production channel (https://gitlab.slack.com/archives/C101F3796/p1671013208117429)
10:24 UTC - incident was declared and zoom call started to investigate the cause,
10:29 UTC - multi_pipeline_scan_result_policies feature flag was disabled globally on GitLab.com,
10:41 UTC - status page was updated with Degraded Performance information with Investigating status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)
10:51 UTC - apdex for sidekiq recovered
10:51 UTC - status page was updated with Degraded Performance information with Monitoring status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)
11:23 UTC - status page was updated with Degraded Performance information with Resolved status (https://status.gitlab.com/pages/history/5b36dc6502d06804c08349f7)

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys"

The vehicle will not start. (the problem)

Why? - The battery is dead.
Why? - The alternator is not functioning.
Why? - The alternator belt has broken.
Why? - The alternator belt was well beyond its useful service life and not replaced.
Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

What went well

new feature was introduced behind feature flag that allowed to quickly disable the feature,
monitoring system worked perfectly with declaring new incident,
SRE and Support teams quickly identified the cause of the dropped performance,

What can be improved

we should encourage engineers to wait at least for 15 minutes before enabling feature flag for more actors:
- during Production Checkin Chatops we could add additional check that prevents from moving forward with enabling feature flag for more actors when requested within 15 minutes after last update,
- we could add information to feature flag rollout template with the information that you should wait at least 15 minutes,
we should encourage engineers to observe https://dashboards.gitlab.net after feature flag is enabled,
- after feature flag is enabled, Chatops could respond on #production channel with the link to https://dashboards.gitlab.net and handbook page on how to observe metrics after enabling feature flag,
- we could add information to feature flag rollout template with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe,
we should help engineers find good resources on how to read metrics from https://dashboards.gitlab.net and look for metrics that are related to introduced change,
as the issue was not reproducible on staging environment we might look for ways to increase the traffic there to be able to properly evaluate how given change affects the environment,
we should investigate potential impact on performance during the development/testing of the MR,

Corrective actions

Proposed actions:

https://gitlab.com/gitlab-com/chatops:
- update production check to verify if the feature was not enabled for actors within last 15 minutes,
- update response from Chatops bot to include link with https://dashboards.gitlab.net/ and link to handbook about proposed dashboards to observe,
https://gitlab.com/gitlab-org/gitlab:
- update feature flag rollout template with the link to https://dashboards.gitlab.net and additional checkbox to check after enabling feature flag and explanation what services to observe,
- update feature flag rollout template with the information that you should wait at least 15 minutes after each step when incrementally enabling feature flag,
- add more traffic to staging environment (ie. generate fake traffic or replicate traffic that from gitlab-com and gitlab-org groups),
https://gitlab.com/gitlab-com/www-gitlab-com:
- add handbook update with explanation on how to read metrics from https://dashboards.gitlab.net after enabling feature flag or deployment of the feature, collaborate with SRE team and record the video about it and additionally we could create a LevelUp course and encourage teams to attend,

Planned actions:

As we have created FCL as a result of this RCA and to keep FCL as single source of truth for our work, we are keeping the work needed to do corrective actions here: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34#work-plan.

Guidelines

Edited Dec 16, 2022 by Alan (Maciej) Paruszewski