2019-11-28 Incident Review: GitLab.com Down due to ip-tables issue

Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected : ~"Service:API" ~"Service:CI Runners" ~"Service:Web"
Team attribution : @gitlab-com/gl-infra
Minutes downtime or degradation : 98 Min. (11:02 - 12:40 UTC)

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

Impact & Metrics

Start with the following:

What was the impact of the incident? All services for GitLab.com were down. All Web and API requests were unable to be processed.
Who was impacted by this incident?
How many customers were affected?
How many attempts were made to access the impacted service/feature?
How many customers tried to access the impacted service/feature?

All users of GitLab.com From: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&from=1574935200000&to=1574946000000. We can extrapolate:

Service	~Workhorse Req/minute	minutes disruption	~requests missed	1d req rate Nov 28 (err rate)	1d req rate Nov 21 (err rate)
Main Web	2900	98	17.05M	187,076,974 (9.1%)	194,680,985 (8.8%)
main git	1500	98	8.82M	116,873,222 (7.5%)	120,696,354 (7.3%)
main api	2600	98	15.29M	165,365,470 (9.2%)	209,633,109 (7.3%)

From: Thanos queries for approximate daily requests

How did the incident impact customers? Users of GitLab.com were unable to push/pull on repositories and were unable to interact with the web application.

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected? Primarily by pagerduty alerts. First alerts came in at 10:48UTC with main down alerts at 11:02UTC. There was a post in #production at 10:40 noting a 500 - it may have been a foreshadowing.
Did alarming work as expected? Mostly - though 20 minutes between the slack post and the first alarms is interesting.
How long did it take from the start of the incident to its detection? 20 minutes
How long did it take from detection to remediation? 98 minutes total
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)

Determining the root cause took longer than it should have. We should have known about the change to add the ip-tables cookbooks to the run lists of the DB nodes sooner.
We did not revert the change iptables on all db nodes at the same time due to an assumption that there were only 10 DB nodes taking traffic. The revert and flush of iptables rules was done on db nodes 1-10, but not 11 and 12.

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in min that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys"

The web application for GitLab.com started throwing 5xx errors for a large number of requests. A roll out of the ip-tables cookbook to a subset of hosts mistakenly included the DB hosts in production. DB hosts started respecting an iptables rule to drop packets - unfortunately this meant all traffic from web and api hosts was being dropped.

Meta - linking to comments below for these answers

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8528#note_251976339

Why was the rollout performed?

Why was the affected set of nodes for the rollout unclear?

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8528#note_252005403

Why was this not tested or observed in staging?

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8528#note_251984670

What went well

Start with the following:

Identify the things that worked well or as expected.
Any additional call-outs for what went particularly well.

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective actions

For every role file we change we should list what hosts will be affected and add it to the MR for review, this should be as easy as a pipeline step that does a knife search -i and post the output

Jarv: must have - https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8628

We need a SSOT for blackout windows / change locks that can be referenced by both deploy and configuration pipelines delivery#583 (closed) Jarv: must have Matt: Can we also have blackout windows as an annotation on Grafana dashboards (e.g. GitLab Triage dashboard)? Jarv: great idea, i would add it to the issue i think we can do that (corrective action 7) Matt: Will do, thanks!
Change locks when they are in place should be automated in pipelines like they are now for deployments

Jarv: must have
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8630

Have a success/failure metric for gitlab-qa runs, setup an alert for failures Nice to have
Perform an audit to see what roles are currently shared by production/staging, if there are role files that are shared we should split them. Cameron: Doesn’t this subvert the intended use of environments in chef? And could this introduce mistakes when moving changes from one role to another? Nice to have
Investigate waiting for chef convergence to complete on staging, before allowing an update on a production role. Very hard to do with chef - Nice to have => Gerir: the separation of staging vs prod should really be a must have -> adding to DNA next week. Dependant on item 5 above

actions 5 and 6 together: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8632

Investigate adding grafana annotations to chef role updates

Ensure that the chef-repo pipeline can be run for all production changes, this will mean doing a more comprehensive compare and apply to the chef server Nice to have/ overlap with 1
--Ensure there are no dependencies on registry.gitlab.com in the chef-repo .gitlab-ci.yml or terraform .gitlab-ci.yml-- - dupe
Create an alerting rule for detecting a significant drop of traffic going to PGBouncer and out of PGBouncer Must have - related to discussion on alerting service dependencies - we need a dashboard like nagios service dependencies as a tool to help us discuss.
move ops.gitlab.net to use a non gitlab.com registry - pipelines could not run because they were trying to pull from prod registry #8569 (closed) (closed) Jarv: this is done
Setup a 400/500 page in Cloudflare to display a response when HAProxy is down.

Hendrik: Created an issue for this: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8598
Must do when we are on cloudflare
Skarbek: should we consider doing something similar to Fastly until we are behind cloudflare?
Hendrik: How would Fastly factor into gitlab.com traffic, as it’s not in-line?
Skarbek: misunderstanding of how fastly is configured on my part, my apologies
Hendrik: No worries :)

We need to investigate why losing just 2 patroni node affected us.

Matt: My guess is: Rails db connection pooler would have kept those two nodes in active use, because (1) Consul’s health checks were not failing for those nodes, so Rails was not notified that the list of replica dbs should change. (2) Because incoming packets were being dropped instead of rejected, each connection attempt was slow to fail instead of quick to fail. That would cause Rails requests that happen to choose the silently unreachable replicas would be very slow compared to other requests. That long tail response time would have quickly come to dominate the pool, since every 2 out of 11 random requests was doomed to stall for many seconds. Most requests in flight would be stalled requests. The above chain of cascading failure should be reproducible, so we can test this hypothesis.
Anthony to take action to make a meeting to discuss this
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8634

Use iptables LOG target to preview what would be dropped. Details.

Define “What is production?” clearly and explicitly–a long list of what is not production and what is production. And be more explicit about what constitutes a change.

Determine whether or not Chef commits require a change issue.

Cameron: It’s my understanding that they do if they affect production.

Deprecate iptables in favor of GCP firewall rules.

Must do.
Matt: We may still need a recipe to disable host-level firewall helpers like “ufw” or “firewalld”. Also, how should we handle VMs on other cloud providers than GCP (e.g. AWS, Azure)? The existing “gitlab-iptables::default” recipe looks like it enables iptables on AWS; not sure if that’s still relevant.
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8638

Audit where there are (or will be) unmanaged iptables rules on hosts that have or currently do use the gitlab-iptables chef cookbook.

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8639

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8747

Guidelines

Incident Review Meeting notes: https://docs.google.com/document/d/1E8FU_fKwkaMTmuCukZ2f-p1YVpWssCl7uGrSqp0vmaA/edit

Edited Dec 18, 2019 by AnthonySandoval