2020-11-11: web fleet on staging is under high load (bug uncovered)
Summary
It seems some admin settings in Staging (merge_requests_committers_approval and more) were enabled by an MR, whereas these settings remained disabled in production/canary. That uncovered an undesired behaviour - in today's deploy to Staging, affecting us (through these settings) - with unexpected high repo traffic coming through our (Stg) Web fleet.
Timeline
All times UTC.
2020-11-11
- 05:11 - Release 13.6.202011110320-00ebce775ae.0c153ab9b53 finished deploying on gstg
- 05:35 - Latency apdex on gstg web fleet is going down
- 09:44 - EOC declares incident in Slack.
- 10:10 - We observe a high amount of Gitaly requests originating from a single endpoint
- 10:16 - We disable this endpoint from our traffic generator script
- 10:30 - Latency apdex is going back up
- 12:18 - We search for a change that could've introduced this regression
- 12:32 - We find an potential MR (gitlab-org/gitlab!46637 (merged))
- 13:38 - We decide to revert the MR and deploy a new release to staging
- 15:47 - We reflect application configuration from production to staging (#3010 (comment 445672932))
- 15:52 - We enable the disabled endpoint, no change in latency apdex
- 16:14 - Staging deployment finishes, no change in latency apdex
- 16:38 - We reach a conclusion on a what happened over the past hours (see summary)
Corrective Actions
- Excessive calls to Gitaly when certain compliance settings enabled
- Document a workflow on how feature changes with an accompanying project settings modification should be handled
Incident Review
Summary
It seems some admin settings in Staging (merge_requests_committers_approval and more) were enabled by an MR, whereas these settings remained disabled in production/canary. That uncovered an undesired behaviour - in today's deploy to Staging, affecting us (through these settings) - with unexpected high repo traffic coming through our (Stg) Web fleet.
- Service(s) affected: ServiceWeb
- Team attribution: groupcompliance
- Minutes downtime or degradation: 617 minute (=10 hours and 17 minutes)
Metrics
CPU utilization on web nodes:
CPU load on Gitaly nodes:
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal customers, the Delivery team
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Web frontend was returning 50x errors or very slow to return successfully
- How many customers were affected?
- The total of the Delivery team was impacted as it prevented them from deploying new releases.
- If a precise customer impact number is unknown, what is the estimated potential impact?
- N/A
Incident Response Analysis
- How was the event detected?
- An alert was fired
- How could detection time be improved?
- N/A
- How did we reach the point where we knew how to mitigate the impact?
- By looking for anomalies in the logs, we found the offending endpoint and who was doing hits: an automated script run by us. We updated the script not to hit this endpoint for the time-being till we investigate the root cause.
- How could time to mitigation be improved?
- N/A
Post Incident Analysis
- How was the root cause diagnosed?
- By looking at the Gitaly calls made by the offending endpoint, we find a call stack originating from approvals handling. We then look at the list of changes introduced in the latest staging deployment, see if there are any changes related to approvals, we find one and we decide to revert it to see if it fixes the problem.
- How could time to diagnosis be improved?
- TBD.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- N/A
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
- Staging was down, why?
- The web fleet was under high CPU usage and requests couldn't finish in time.
- Why the web fleet was under high CPU usage?
- One particular request (MRs index) sent by the staging traffic generator started to consume a lot of resources to finish.
- Why did this request started to consume a lot of resources to finish?
- A recent deployment introduced a configuration change that caused the request to behave this way.
- Why did the configuration change cause such effect?
- The configuration change triggered a code path that ended up calling Gitaly for each MR in the MRs index page to evaluate MR approvers.
- Why would calling Gitaly for each MR cause such bad effect on staging?
- Ideally it wouldn't, but it just happens that there was an MR in that MRs index that had 13K commits, that what caused the high resource usage.



