2020-11-11: web fleet on staging is under high load (bug uncovered)

Summary

It seems some admin settings in Staging (merge_requests_committers_approval and more) were enabled by an MR, whereas these settings remained disabled in production/canary. That uncovered an undesired behaviour - in today's deploy to Staging, affecting us (through these settings) - with unexpected high repo traffic coming through our (Stg) Web fleet.

Timeline

All times UTC.

2020-11-11

05:11 - Release 13.6.202011110320-00ebce775ae.0c153ab9b53 finished deploying on gstg
05:35 - Latency apdex on gstg web fleet is going down
09:44 - EOC declares incident in Slack.
10:10 - We observe a high amount of Gitaly requests originating from a single endpoint
10:16 - We disable this endpoint from our traffic generator script
10:30 - Latency apdex is going back up
12:18 - We search for a change that could've introduced this regression
12:32 - We find an potential MR (gitlab-org/gitlab!46637 (merged))
13:38 - We decide to revert the MR and deploy a new release to staging
15:47 - We reflect application configuration from production to staging (#3010 (comment 445672932))
15:52 - We enable the disabled endpoint, no change in latency apdex
16:14 - Staging deployment finishes, no change in latency apdex
16:38 - We reach a conclusion on a what happened over the past hours (see summary)

Corrective Actions

Incident Review

Summary

Service(s) affected: ServiceWeb
Team attribution: groupcompliance
Minutes downtime or degradation: 617 minute (=10 hours and 17 minutes)

Metrics

CPU utilization on web nodes:

CPU load on Gitaly nodes:

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal customers, the Delivery team
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Web frontend was returning 50x errors or very slow to return successfully
How many customers were affected?
- The total of the Delivery team was impacted as it prevented them from deploying new releases.
If a precise customer impact number is unknown, what is the estimated potential impact?
- N/A

Incident Response Analysis

How was the event detected?
- An alert was fired
How could detection time be improved?
- N/A
How did we reach the point where we knew how to mitigate the impact?
- By looking for anomalies in the logs, we found the offending endpoint and who was doing hits: an automated script run by us. We updated the script not to hit this endpoint for the time-being till we investigate the root cause.
How could time to mitigation be improved?
- N/A

Post Incident Analysis

How was the root cause diagnosed?
- By looking at the Gitaly calls made by the offending endpoint, we find a call stack originating from approvals handling. We then look at the list of changes introduced in the latest staging deployment, see if there are any changes related to approvals, we find one and we decide to revert it to see if it fixes the problem.
How could time to diagnosis be improved?
- TBD.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- N/A
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- Yes, gitlab-org/gitlab!46637 (merged)

5 Whys

Staging was down, why?
- The web fleet was under high CPU usage and requests couldn't finish in time.
Why the web fleet was under high CPU usage?
- One particular request (MRs index) sent by the staging traffic generator started to consume a lot of resources to finish.
Why did this request started to consume a lot of resources to finish?
- A recent deployment introduced a configuration change that caused the request to behave this way.
Why did the configuration change cause such effect?
- The configuration change triggered a code path that ended up calling Gitaly for each MR in the MRs index page to evaluate MR approvers.
Why would calling Gitaly for each MR cause such bad effect on staging?
- Ideally it wouldn't, but it just happens that there was an MR in that MRs index that had 13K commits, that what caused the high resource usage.

Lessons Learned

Guidelines

Blameless RCA Guideline

Edited Dec 09, 2020 by Alberto Ramos