2021-01-28 Errors deploying to GKE due to being unable to pull images

Summary

The current deploy is having issues stuck in imagepull backoff loops. It looks like all ports for the CloudNAT for our GKE cluster are occupied, intermittently blocking the ability to do image pulls.

Timeline

All times UTC.

2021-01-28

13:22 - First attempted deploy to production failed in Kubernetes
14:05 - Release Manager restarts the job
15:18 - Release Managers note the retry is taking an awkwardly long period of time and request assistance from Infrastructure
15:23 - Symptom of deployment failure identified - nodes are unable to pull images from our dev instances' Container Registry
14:30 - Discovery that we've run out of available ports on the cloud NAT router that services the GKE infrastructure
15:10 - Mitigation via adding an additional IP address

Corrective Actions

We do not monitor port availability from out Cloud NAT devices. This lead to port exhaustion unknowingly impacting deployments and likely other systems - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12541
We do not fully understand the IP allocation capabilities and customization capabilities needed to grow any further - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12542

Incident Review

Summary

We leverage the use of Google's Cloud NAT device to provide the capability of a preset configure set of IP addresses for which external network traffic can egress allowing customers to more confidently define which IP blocks can be allowed to reach into their infrasture endpoints. This enables us to more securely configure our clusters in GCP as private, such that the nodes do not have public facing IP addresses. Cloud NAT devices have a limited set of connections that are allowed to traverse the device, in this case 65,536 allowed ports per IP address. At the time of the incident, we were using 14 IP addresses, providing us with a maximum capacity of 917504 available connections for ANY type of traffic going in or out of our GKE infrastructure from the outside world. To mitigate this, we expanded the IP addresses to be utilized by this Cloud NAT device by 2, bumping our maximum capacity to 1048576.

Service(s) affected: Deployments were the trigger, but all services running inside of Kubernetes would be impacted by this. This includes the following: Mailroom, Container Registry, Git SSH/HTTPS traffic, Websockets, PlantUML, and Sidekiq
Team attribution: Infrastructure
Time to detection: 60 minutes
Minutes downtime or degradation: 210 minutes in state degraded - this calculated via the NAT port availability chart

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Any user that attempted to use a service running on Kubernetes will have a chance to had been impacted.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Connection Refused or Connection Timeout
How many customers were affected?
1. Unknown
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Unknown

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. Deployment Failure
How could detection time be improved?
1. Monitoring saturation of NAT port consumption
How was the root cause diagnosed?
1. Validating from a GKE node, its inability to make a request to the external world, in our case, a docker pull was made on the command line.
How could time to diagnosis be improved?
1. Having monitoring for this would be a great first step
How did we reach the point where we knew how to mitigate the impact?
1. See above
How could time to mitigation be improved?
1. See above
What went well?
1. Collaboration amoungst various team members while working through both this as well as many other incidents at the same time.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. A similar event occurred in our staging environment a few months ago: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12060
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No

Lessons Learned

We learned that we have gaps in our monitoring for our Cloud NAT device saturation

Guidelines

Blameless RCA Guideline

Incident Review Stakeholders

Edited Feb 05, 2021 by John Skarbek