2020-10-24: Canary web performance very slow and giving increased 500 rate (#2889) · Issues · GitLab.com / GitLab Infrastructure Team / Production

2020-10-24: Canary web performance very slow and giving increased 500 rate

## Summary ### 2020-10-24: Canary web performance very slow and giving increased 500 rate ## Timeline All times UTC. 2020-10-24 - 18:33 - DDOS begins - 18:45 - [First PagerDuty Alert](https://gitlab.pagerduty.com/incidents/P2TY6S8) - 18:50 - cmcfarland declares incident in Slack using `/incident declare` command. - 19:26 - cmcfarland disables canary fleet - 19:40 - cmcfarland re-enables canary fleet - 20:22 - cmcfarland-admin set the `gitlab-org/gitlab` project to `Internal` - 20:42 - Canary Recovery - 21:22 - cmcfarland created a firewall rule to add a JS Challenge - 21:28 - cmcfarland-admin set the `gitlab-org/gitlab` project to `Public`  <h2>Incident Review</h2>  ## Summary A large volume of incoming traffic to the `gitlab-org/gitlab` project was causing the Canary web fleet to be saturated and very slow to respond. The offending traffic was primarily directed at spam issues created in that project. 1. Service(s) affected: **Web services on Canary** 1. Team attribution: 1. Minutes downtime or degradation: **2 hours and 10 minutes**  ## Metrics  ## Customer Impact 1. Who was impacted by this incident? **All web customers using the Canary fleet or using projects we specifically route access through Canary** 2. What was the customer experience during the incident? **Web timeouts and unable to connect errors** 3. How many customers were affected? **This is difficult to know since we don't track unique users, but if we were to assume that each IP is, generally, a unique user, we could estimate that around 160,000 users may have been impacted. This is based on [this visualization](https://log.gprd.gitlab.net/goto/98b8c9ee812f90b374fa306979a21b17) and assuming traffic on the 25th is roughly similar to normal traffic on the 24th.** 4. If a precise customer impact number is unknown, what is the estimated potential impact? **See previous question.** ## Incident Response Analysis 1. How was the event detected? **[PagerDuty Alert](https://gitlab.pagerduty.com/incidents/P2TY6S8) that the gitlab foss project issue list was not rendering.** 2. How could detection time be improved? **This was 12 minutes from the initial DDOS starting. And it took about five minutes to really fall over, so this is not a bad detection time of around 5 to 7 minutes** 3. How did we reach the point where we knew how to mitigate the impact? **This took a little trial and error. Our past pattern of thinking has been that canary isolated events like this might be canary code or infrastructure issue. Shutting down canary gave us information that the problem followed the traffic and was not canary specific. Therefore the next steps were to investigate the traffic to canary and/or specific canary projects. This did find the traffic surge.** 4. How could time to mitigation be improved? ## Post Incident Analysis 1. How was the root cause diagnosed? **The results of the canary being turn down and backup combined with the log examination of traffic led to a root cause of incoming volume of traffic.** 2. How could time to diagnosis be improved? 3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? **[Internal Application Rate Limiting](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/341) could have reduced the impact of this type of event.** 4. Was this incident triggered by a change (deployment of code or change to infrastructure. _If yes, have you linked the issue which represents the change?_)? **No.** ## 5 Whys  ## Lessons Learned  1. Reducing the visibility of the project that is under attack is a high cost action that results in lots of disruption for merge requests via fork to that project. We should avoid using this as a tactic in the future. A better option is Cloudflare rules to help mitigate the traffic. 1. Rate limiting un-authorized users could mitigate this kind of DDOS. In general, leveraging Cloudflare better could help prevent these incidents. ## Corrective Actions  - [Update runbooks to provide better documentation on DOS mitigation options during an incident](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11771) - [Identify a method to programmatically identify and rate limit anonymous vs authenticated user traffic](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11773) ## Guidelines - [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html#meeting-purpose) </details>

issue