2020-06-04 Large load spike on API fleet causing response degradation

/label incident IncidentActive

Summary

Large load spike on API fleet causing response degradation

Not sure what's going on yet

Timeline

All times UTC.

2020-06-04

03:14 - First alert received: API service has apdex score (latency) below SLO - Further alerts received later, but all the same cause.
03:17 - cmiskell declares incident in Slack using /incident declare command.
03:30 - Three problem IP addresses from AWS EC2 address ranges were identified as the source of the traffic, spamming on /api/v4/projects at a high rate.
03:31 - IP addresses blocked
03:34 - API service apdex returns to normal and alert clears. Other alerts clear over the next 5 minutes.
03:39 - All alerts cleared.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

A small number of IP addresses from AWS, possibly correlated with a legitimate actor but misbehaving for some reason, began hitting a single API endpoint (/api/v4/projects) with no parameters, returning some JSON containing details of 20 (most recently created?) public projects, at a rate of 700-800 times a second. This caused CPU usage on the API servers to max out at close to 100%, affecting all other calls to the API servers.

It is believed this was some erroneous activity from a 3rd party that has otherwise been specifically allowed to bypass CloudFlare filtering from a list of known IP addresses; the 'bypass' activity on their rule went up at the same time as the problem traffic, and when we blocked the 3 IP addresses, the rate of activity on the bypass rule went up further again. However, it appears that the 'good' activity was sufficiently moderated and sensible that it did not cause any load issues. Therefore I believe the 3 bad IPs were misbehaving nodes; when they started getting immediate errors, I suspect the job dispatch system at the other end routed more jobs to the successful nodes where they succeeded. Or something like that; this is speculation based on a very few data points.

Service(s) affected: ServiceAPI, ServiceCI Runners
Team attribution: teamReliability
Minutes downtime or degradation: 24 minutes

Metrics

Customer Impact

Who was impacted by this incident? All users of gitlab.com
What was the customer experience during the incident? Timeouts or slow responses to API calls, affecting various web user experiences and automated jobs. CI jobs were also delayed/stalled as they use the APIs to fetch jobs and report results.
How many customers were affected? Unknown (calculable if necessary, but it's may not be worth the time) but anyone using GitLab.com at the time was potentially affected.
If a precise customer impact number is unknown, what is the estimated potential impact? 24 minutes of degraded responses for API calls

Incident Response Analysis

How was the event detected? Apdex alerting
How could detection time be improved? Difficult to impossible without making the alerting much more twitchy and noisy
How did we reach the point where we knew how to mitigate the impact? By seeing the CPU load on the Fleet Overview dashboard, and then doing some ad-hoc log analysis on a single API node
How could time to mitigation be improved? If ElasticSearch hadn't chosen that precise period to have a hernia over some weird field, the analysis could have been faster and the problematic IP addresses discovered faster. This is unusual though, as usually ElasticSearch is working fine

Post Incident Analysis

How was the root cause diagnosed? Log analysis
How could time to diagnosis be improved? If ElasticSearch hadn't chosen that precise period to have a hernia over some weird field, the analysis could have been faster and the problematic IP addresses discovered faster. This is unusual though, as usually ElasticSearch is working fine
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? None that I know of, other than occasional talk of a generalized rate-limiting feature that takes into consideration account status (anonymous vs paid), IP addresses, and a number of other details.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? Not one of our changes.

5 Whys

Why did this happen? Because we have a public presence and it is easy for third parties to make requests at a rate that exceeds our ability to handle them. Why can we not just scale up? The cost would be prohibitive to be able to scale to handle any possible inbound load. Kubernetes autoscaling will help, but will still need to have some upper limits, and there may still be layers (DB) that cannot easily autoscale.

Lessons Learned

We really need highly functional rate-limiting that can do more than just attempt (or fail) to limit by simplistic measures like requests/s/ip-address. IP address is one dimension, but so is paid vs unpaid, plan level, number of projects, trust-level (of IP address/netblock/other things), multi-window rate-limits (allow bursts, disallow long term excessive usage).

Corrective Actions

scalability#127 (closed) describes a possible solution (or the start of one), and may be something to hang some work from.

Guidelines

Blameless RCA Guideline

Edited Jun 05, 2020 by Craig Miskell