2021-03-15 Cloud NAT in gprd is at its limit

Summary

The current deploy is having issues stuck in imagepull backoff loops. It looks like all ports for the CloudNAT for our GKE cluster are occupied, intermittently blocking the ability to do image pulls.

Timeline

All times UTC.

2021-03-15

14:41 - Deployment to Kubernetes Infrastructure times out
14:49 - First attempted deploy to production failed in Kubernetes
19:24 - Mitigation via modification to deployments to stagger cluster deploys

Corrective Actions

Modification to our deployments to stagger cluster jobs: #3981 (comment 530865965)
We do not monitor port availability from out Cloud NAT devices. This lead to port exhaustion unknowingly impacting deployments and likely other systems - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12541
We do not fully understand the IP allocation capabilities and customization capabilities needed to grow any further - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12542

Incident Review

Summary

We leverage the use of Google's Cloud NAT device to provide the capability of a preset configure set of IP addresses for which external network traffic can egress allowing customers to more confidently define which IP blocks can be allowed to reach into their infrastructure endpoints. This enables us to more securely configure our clusters in GCP as private, such that the nodes do not have public facing IP addresses. Cloud NAT devices have a limited set of connections that are allowed to traverse the device, in this case 65,536 allowed ports per IP address. At the time of the incident, we were using 14 IP addresses, providing us with a maximum capacity of 917504 available connections for ANY type of traffic going in or out of our GKE infrastructure from the outside world. Due to being unable to expand our IP address range due to the need to notify customers ahead of time in order to ensure that traffic to heavily firewalled customers does not break, we've elected to see if we can stagger deployments.

Service(s) affected: Deployments were the trigger, but all services running inside of Kubernetes would be impacted by this. This includes the following: Mailroom, Container Registry, Git SSH/HTTPS traffic, Websockets, PlantUML, and Sidekiq
Team attribution: Infrastructure
Time to detection: 60 minutes
Minutes downtime or degradation: We will remain in a high state of risk until the corrective actions created back in January are appropriately taken care of.

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Any user that attempted to use a service running on Kubernetes will have a chance to had been impacted.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Connection Refused or Connection Timeout
How many customers were affected?
1. Unknown
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Unknown

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. Deployment Failure
How could detection time be improved?
1. Monitoring saturation of NAT port consumption
How was the root cause diagnosed?
1. Validating from a GKE node, its inability to make a request to the external world, in our case, a docker pull was made on the command line.
How could time to diagnosis be improved?
1. Having monitoring for this would be a great first step
How did we reach the point where we knew how to mitigate the impact?
1. See above
How could time to mitigation be improved?
1. See above
What went well?
1. Collaboration amoug various team members while working through both this as well as many other incidents at the same time.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. A similar event occurred in our staging environment a few months ago: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12060
2. And then again in January: #3448 (closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Indirectly - code change resulted in the infrastructure needing to create network connections to a service in order to perform the deploy

Lessons Learned

We learned that we have gaps in our monitoring for our Cloud NAT device saturation
As we advertise the range of IP's that we utilize for our customer base, we've elected to not expand that range until we've had ample time to notify customers ahead of such a change

Guidelines

Blameless RCA Guideline

Incident Review Stakeholders

Edited Mar 16, 2021 by John Skarbek