2020-06-04 Large load spike on API fleet causing response degradation
/label incident IncidentActive
Summary
Large load spike on API fleet causing response degradation
Not sure what's going on yet
Timeline
All times UTC.
2020-06-04
- 03:14 - First alert received: API service has apdex score (latency) below SLO - Further alerts received later, but all the same cause.
- 03:17 - cmiskell declares incident in Slack using
/incident declare
command. - 03:30 - Three problem IP addresses from AWS EC2 address ranges were identified as the source of the traffic, spamming on /api/v4/projects at a high rate.
- 03:31 - IP addresses blocked
- 03:34 - API service apdex returns to normal and alert clears. Other alerts clear over the next 5 minutes.
- 03:39 - All alerts cleared.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
A small number of IP addresses from AWS, possibly correlated with a legitimate actor but misbehaving for some reason, began hitting a single API endpoint (/api/v4/projects) with no parameters, returning some JSON containing details of 20 (most recently created?) public projects, at a rate of 700-800 times a second. This caused CPU usage on the API servers to max out at close to 100%, affecting all other calls to the API servers.
It is believed this was some erroneous activity from a 3rd party that has otherwise been specifically allowed to bypass CloudFlare filtering from a list of known IP addresses; the 'bypass' activity on their rule went up at the same time as the problem traffic, and when we blocked the 3 IP addresses, the rate of activity on the bypass rule went up further again. However, it appears that the 'good' activity was sufficiently moderated and sensible that it did not cause any load issues. Therefore I believe the 3 bad IPs were misbehaving nodes; when they started getting immediate errors, I suspect the job dispatch system at the other end routed more jobs to the successful nodes where they succeeded. Or something like that; this is speculation based on a very few data points.
- Service(s) affected: ServiceAPI, ServiceCI Runners
- Team attribution: teamReliability
- Minutes downtime or degradation: 24 minutes
Metrics
Customer Impact
- Who was impacted by this incident? All users of gitlab.com
- What was the customer experience during the incident? Timeouts or slow responses to API calls, affecting various web user experiences and automated jobs. CI jobs were also delayed/stalled as they use the APIs to fetch jobs and report results.
- How many customers were affected? Unknown (calculable if necessary, but it's may not be worth the time) but anyone using GitLab.com at the time was potentially affected.
- If a precise customer impact number is unknown, what is the estimated potential impact? 24 minutes of degraded responses for API calls
Incident Response Analysis
- How was the event detected? Apdex alerting
- How could detection time be improved? Difficult to impossible without making the alerting much more twitchy and noisy
- How did we reach the point where we knew how to mitigate the impact? By seeing the CPU load on the Fleet Overview dashboard, and then doing some ad-hoc log analysis on a single API node
- How could time to mitigation be improved? If ElasticSearch hadn't chosen that precise period to have a hernia over some weird field, the analysis could have been faster and the problematic IP addresses discovered faster. This is unusual though, as usually ElasticSearch is working fine
Post Incident Analysis
- How was the root cause diagnosed? Log analysis
- How could time to diagnosis be improved? If ElasticSearch hadn't chosen that precise period to have a hernia over some weird field, the analysis could have been faster and the problematic IP addresses discovered faster. This is unusual though, as usually ElasticSearch is working fine
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? None that I know of, other than occasional talk of a generalized rate-limiting feature that takes into consideration account status (anonymous vs paid), IP addresses, and a number of other details.
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? Not one of our changes.
5 Whys
Why did this happen? Because we have a public presence and it is easy for third parties to make requests at a rate that exceeds our ability to handle them. Why can we not just scale up? The cost would be prohibitive to be able to scale to handle any possible inbound load. Kubernetes autoscaling will help, but will still need to have some upper limits, and there may still be layers (DB) that cannot easily autoscale.
Lessons Learned
We really need highly functional rate-limiting that can do more than just attempt (or fail) to limit by simplistic measures like requests/s/ip-address. IP address is one dimension, but so is paid vs unpaid, plan level, number of projects, trust-level (of IP address/netblock/other things), multi-window rate-limits (allow bursts, disallow long term excessive usage).
Corrective Actions
scalability#127 (closed) describes a possible solution (or the start of one), and may be something to hang some work from.