RCA - Sidekiq shard_urgent_cpu_bound saturated after enabling multi_pipeline_scan_result_policies feature flag

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.


Summary

Related incident: 2022-12-14: Sidekiq shard_urgent_cpu_bound satu... (production#8159 - closed)

After enabling multi_pipeline_scan_result_policies feature flag we have noticed delays in background processing for approximately 45 minutes that impacted both web and api. The primary impact was slower processing of CI jobs, issues, merge requests and notes.

The identified root cause was a feature-flag rollout [Feature flag] Rollout of `multi_pipeline_scan_... (gitlab-org/gitlab#382990 - closed).

Service(s) affected : Web, API, Sidekiq Team attribution: groupsecurity policies Minutes downtime or degradation: 2022-12-14 10:00 UTC -10:45 UTC (45min)

Impact & Metrics

Start with the following:

Question Answer
What was the impact delays in background processing that impacted both web and api
Who was impacted external/internal customers
How did this impact customers Slower processing of CI jobs, issues, merge requests and notes
How many attempts made to access -
How many customers affected -
How many customers tried to access -

Graphs

Affected Sidekiq Jobs

image1

Queue length after feature was enabled

image2

Apdex for Sidekiq service when feature flag was enabled

image3

Detection & Response

Start with the following:

Question Answer
When was the incident detected? 2022-12-14 10:20 UTC
How was the incident detected? Alertmanager alert
Did alarming work as expected? Yes
How long did it take from the start of the incident to its detection? Apdex started dropping from 10:06 UTC
How long did it take from detection to remediation? Feature flag was disabled at 10:29 UTC (9 minutes)
What steps were taken to remediate? Created slack channel and zoom call for incident, disabling feature flags, observing charts for improvements after feature flag was disabled
Were there any issues with the response? No

MR Checklist

Consider these questions if a code change introduced the issue.

Question Answer
Was the MR acceptance checklist marked as reviewed in the MR? Yes
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? No, the checklist for feature flag rollout should be updated to reduce chances of future recurrences

Timeline

2022-11-30

2022-12-01

2022-12-05

  • 12:39 UTC - multi_pipeline_scan_result_policies feature flag was enabled on dev/staging/staging-ref environments => no impact on apdex was observed

2022-12-12

  • 13:19 UTC - multi_pipeline_scan_result_policies feature was enabled on gitlab-org/gitlab, gitlab-org/gitlab-foss, gitlab-com/www-gitlab-com, gitlab-org/govern/demos/sandbox/issue-379108-verification projects,
  • 13:29 UTC - multi_pipeline_scan_result_policies feature was disabled on gitlab-org/gitlab, gitlab-org/gitlab-foss, gitlab-com/www-gitlab-com, gitlab-org/govern/demos/sandbox/issue-379108-verification projects as it was not working as expected (no timeouts were discovered or any other impact on apdex),

2022-12-13

2022-12-14

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys"

The vehicle will not start. (the problem)

  1. Why? - The battery is dead.
  2. Why? - The alternator is not functioning.
  3. Why? - The alternator belt has broken.
  4. Why? - The alternator belt was well beyond its useful service life and not replaced.
  5. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

What went well

  • new feature was introduced behind feature flag that allowed to quickly disable the feature,
  • monitoring system worked perfectly with declaring new incident,
  • SRE and Support teams quickly identified the cause of the dropped performance,

What can be improved

  • we should encourage engineers to wait at least for 15 minutes before enabling feature flag for more actors:
    • during Production Checkin Chatops we could add additional check that prevents from moving forward with enabling feature flag for more actors when requested within 15 minutes after last update,
    • we could add information to feature flag rollout template with the information that you should wait at least 15 minutes,
  • we should encourage engineers to observe https://dashboards.gitlab.net after feature flag is enabled,
  • we should help engineers find good resources on how to read metrics from https://dashboards.gitlab.net and look for metrics that are related to introduced change,
  • as the issue was not reproducible on staging environment we might look for ways to increase the traffic there to be able to properly evaluate how given change affects the environment,
  • we should investigate potential impact on performance during the development/testing of the MR,

Corrective actions

Proposed actions:

Planned actions:

As we have created FCL as a result of this RCA and to keep FCL as single source of truth for our work, we are keeping the work needed to do corrective actions here: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34#work-plan.

Guidelines

Edited by Alan (Maciej) Paruszewski