2020-10-24: Canary web performance very slow and giving increased 500 rate
<!-- ISSUE TITLING: use the form "YYYY-MM-DD: briefly describe problem" --> <!-- ISSUE LABELING: Don't forget to add labels for severity (severity::1 - severity::4) and service. if the incident relates to sensitive data, or is security related use the label ~security and mark it confidential. --> ## Summary ### 2020-10-24: Canary web performance very slow and giving increased 500 rate ## Timeline All times UTC. 2020-10-24 - 18:33 - DDOS begins - 18:45 - [First PagerDuty Alert](https://gitlab.pagerduty.com/incidents/P2TY6S8) - 18:50 - cmcfarland declares incident in Slack using `/incident declare` command. - 19:26 - cmcfarland disables canary fleet - 19:40 - cmcfarland re-enables canary fleet - 20:22 - cmcfarland-admin set the `gitlab-org/gitlab` project to `Internal` - 20:42 - Canary Recovery - 21:22 - cmcfarland created a firewall rule to add a JS Challenge - 21:28 - cmcfarland-admin set the `gitlab-org/gitlab` project to `Public` <!-- THE BELOW IS TO BE CONDUCTED ONCE THE ABOVE INCIDENT IS MITIGATED. TRANSFER DATA FROM THE ABOVE INTO THE INCIDENT REVIEW SECTIONS BELOW. --> <h2>Incident Review</h2> <!-- The purpose of this Incident Review is to serve as a classroom to help us better understand the root causes of an incident. Treating it as a classroom allows us to create the space to let us focus on devising the mechanisms needed to prevent a similar incident from recurring in the future. A root cause can **never be a person** and this Incident Review should be written to refer to the system and the context rather than the specific actors. As placeholders for names, consider the usage of nouns like "technician", "engineer on-call", "developer", etc.. --> ## Summary A large volume of incoming traffic to the `gitlab-org/gitlab` project was causing the Canary web fleet to be saturated and very slow to respond. The offending traffic was primarily directed at spam issues created in that project. 1. Service(s) affected: **Web services on Canary** 1. Team attribution: 1. Minutes downtime or degradation: **2 hours and 10 minutes** <!-- _For calculating duration of event, use the [Platform Metrics Dashboard](https://dashboards.gitlab.net/d/general-triage/general-platform-triage?orgId=1) to look at appdex and SLO violations._ --> ## Metrics <!-- _Provide any relevant graphs that could help understand the impact of the incident and its dynamics._ --> ## Customer Impact 1. Who was impacted by this incident? **All web customers using the Canary fleet or using projects we specifically route access through Canary** 2. What was the customer experience during the incident? **Web timeouts and unable to connect errors** 3. How many customers were affected? **This is difficult to know since we don't track unique users, but if we were to assume that each IP is, generally, a unique user, we could estimate that around 160,000 users may have been impacted. This is based on [this visualization](https://log.gprd.gitlab.net/goto/98b8c9ee812f90b374fa306979a21b17) and assuming traffic on the 25th is roughly similar to normal traffic on the 24th.** 4. If a precise customer impact number is unknown, what is the estimated potential impact? **See previous question.** ## Incident Response Analysis 1. How was the event detected? **[PagerDuty Alert](https://gitlab.pagerduty.com/incidents/P2TY6S8) that the gitlab foss project issue list was not rendering.** 2. How could detection time be improved? **This was 12 minutes from the initial DDOS starting. And it took about five minutes to really fall over, so this is not a bad detection time of around 5 to 7 minutes** 3. How did we reach the point where we knew how to mitigate the impact? **This took a little trial and error. Our past pattern of thinking has been that canary isolated events like this might be canary code or infrastructure issue. Shutting down canary gave us information that the problem followed the traffic and was not canary specific. Therefore the next steps were to investigate the traffic to canary and/or specific canary projects. This did find the traffic surge.** 4. How could time to mitigation be improved? ## Post Incident Analysis 1. How was the root cause diagnosed? **The results of the canary being turn down and backup combined with the log examination of traffic led to a root cause of incoming volume of traffic.** 2. How could time to diagnosis be improved? 3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? **[Internal Application Rate Limiting](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/341) could have reduced the impact of this type of event.** 4. Was this incident triggered by a change (deployment of code or change to infrastructure. _If yes, have you linked the issue which represents the change?_)? **No.** ## 5 Whys <!-- _This section is meant to dig into lessons learned and corrective actions, it is not limited to 5 and consider how you may dive deeper into each why_ _example:_ 1. Customers experienced an inability to create new projects on GitLab.com, why? - A code change was deployed which contained an escaped bug. 1. Why did this bug not get noticed in staging? - The integration test for this use case is missing. 1. Why is an integration test for this use case missing? - It was inadvertently removed during a refactoring of our test suite. 1. Why was the test suite being refactored? - As part of our efforts to decrease MTTP. 1. Why did it take 2 hours to notice this issue in production? - The initial alert was supressed as a false alarm. 1. Why was this alert suppressed - The system which dedupes alerts inadvertently suppressed this alarm as a duplicate. 1. Why did it take 4 hours to resolve the issue in production? - The change which carried this escaped bug also contained a database schema change which made rolling the change back impossible. Engineering was engaged immediately by the oncall SRE and conducted a forward fix. --> ## Lessons Learned <!-- _Be explicit about what lessons we learned and should carry forward. These usually inform what our corrective actions should be._ _example:_ 1. The results of refactoring activites around our integration tests should be reviewed. (i.e we had 619 tests before refactor but 618 after.) 2. Our tooling to dedupe alarms should have integration tests to ensure it works against existing and newly added alarms. --> 1. Reducing the visibility of the project that is under attack is a high cost action that results in lots of disruption for merge requests via fork to that project. We should avoid using this as a tactic in the future. A better option is Cloudflare rules to help mitigate the traffic. 1. Rate limiting un-authorized users could mitigate this kind of DDOS. In general, leveraging Cloudflare better could help prevent these incidents. ## Corrective Actions <!-- - _Use Lessons Learned as a guideline for creation of Corrective Actions - _List issues that have been created as corrective actions from this incident._ - _For each issue, include the following:_ - _<Bare Issue link> - Issue labeled as ~"corrective action"._ - _Include an estimated date of completion of the corrective action._ - _Include the named individual who owns the delivery of the corrective action._ --> - [Update runbooks to provide better documentation on DOS mitigation options during an incident](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11771) - [Identify a method to programmatically identify and rate limit anonymous vs authenticated user traffic](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11773) ## Guidelines - [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html#meeting-purpose) </details>
issue