2020-10-24: Canary web performance very slow and giving increased 500 rate

Summary

2020-10-24: Canary web performance very slow and giving increased 500 rate

Timeline

All times UTC.

2020-10-24

18:33 - DDOS begins
18:45 - First PagerDuty Alert
18:50 - cmcfarland declares incident in Slack using /incident declare command.
19:26 - cmcfarland disables canary fleet
19:40 - cmcfarland re-enables canary fleet
20:22 - cmcfarland-admin set the gitlab-org/gitlab project to Internal
20:42 - Canary Recovery
21:22 - cmcfarland created a firewall rule to add a JS Challenge
21:28 - cmcfarland-admin set the gitlab-org/gitlab project to Public

Incident Review

Summary

A large volume of incoming traffic to the gitlab-org/gitlab project was causing the Canary web fleet to be saturated and very slow to respond. The offending traffic was primarily directed at spam issues created in that project.

Service(s) affected: Web services on Canary
Team attribution:
Minutes downtime or degradation: 2 hours and 10 minutes

Metrics

Customer Impact

Who was impacted by this incident? All web customers using the Canary fleet or using projects we specifically route access through Canary
What was the customer experience during the incident? Web timeouts and unable to connect errors
How many customers were affected? This is difficult to know since we don't track unique users, but if we were to assume that each IP is, generally, a unique user, we could estimate that around 160,000 users may have been impacted. This is based on this visualization and assuming traffic on the 25th is roughly similar to normal traffic on the 24th.
If a precise customer impact number is unknown, what is the estimated potential impact? See previous question.

Incident Response Analysis

How was the event detected? PagerDuty Alert that the gitlab foss project issue list was not rendering.
How could detection time be improved? This was 12 minutes from the initial DDOS starting. And it took about five minutes to really fall over, so this is not a bad detection time of around 5 to 7 minutes
How did we reach the point where we knew how to mitigate the impact? This took a little trial and error. Our past pattern of thinking has been that canary isolated events like this might be canary code or infrastructure issue. Shutting down canary gave us information that the problem followed the traffic and was not canary specific. Therefore the next steps were to investigate the traffic to canary and/or specific canary projects. This did find the traffic surge.
How could time to mitigation be improved?

Post Incident Analysis

How was the root cause diagnosed? The results of the canary being turn down and backup combined with the log examination of traffic led to a root cause of incoming volume of traffic.
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Internal Application Rate Limiting could have reduced the impact of this type of event.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No.

5 Whys

Lessons Learned

Reducing the visibility of the project that is under attack is a high cost action that results in lots of disruption for merge requests via fork to that project. We should avoid using this as a tactic in the future. A better option is Cloudflare rules to help mitigate the traffic.
Rate limiting un-authorized users could mitigate this kind of DDOS. In general, leveraging Cloudflare better could help prevent these incidents.

Corrective Actions

Guidelines

Blameless RCA Guideline

Edited Nov 02, 2020 by Cameron McFarland