2021-01-28 Errors deploying to GKE due to being unable to pull images

Summary

The current deploy is having issues stuck in imagepull backoff loops. It looks like all ports for the CloudNAT for our GKE cluster are occupied, intermittently blocking the ability to do image pulls.

Timeline

All times UTC.

2021-01-28

  • 13:22 - First attempted deploy to production failed in Kubernetes
  • 14:05 - Release Manager restarts the job
  • 15:18 - Release Managers note the retry is taking an awkwardly long period of time and request assistance from Infrastructure
  • 15:23 - Symptom of deployment failure identified - nodes are unable to pull images from our dev instances' Container Registry
  • 14:30 - Discovery that we've run out of available ports on the cloud NAT router that services the GKE infrastructure
  • 15:10 - Mitigation via adding an additional IP address

Corrective Actions

Incident Review

Summary

We leverage the use of Google's Cloud NAT device to provide the capability of a preset configure set of IP addresses for which external network traffic can egress allowing customers to more confidently define which IP blocks can be allowed to reach into their infrasture endpoints. This enables us to more securely configure our clusters in GCP as private, such that the nodes do not have public facing IP addresses. Cloud NAT devices have a limited set of connections that are allowed to traverse the device, in this case 65,536 allowed ports per IP address. At the time of the incident, we were using 14 IP addresses, providing us with a maximum capacity of 917504 available connections for ANY type of traffic going in or out of our GKE infrastructure from the outside world. To mitigate this, we expanded the IP addresses to be utilized by this Cloud NAT device by 2, bumping our maximum capacity to 1048576.

  1. Service(s) affected: Deployments were the trigger, but all services running inside of Kubernetes would be impacted by this. This includes the following: Mailroom, Container Registry, Git SSH/HTTPS traffic, Websockets, PlantUML, and Sidekiq
  2. Team attribution: Infrastructure
  3. Time to detection: 60 minutes
  4. Minutes downtime or degradation: 210 minutes in state degraded - this calculated via the NAT port availability chart

Metrics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. Any user that attempted to use a service running on Kubernetes will have a chance to had been impacted.
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    1. Connection Refused or Connection Timeout
  3. How many customers were affected?
    1. Unknown
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
    1. Unknown

What were the root causes?

"5 Whys"

Incident Response Analysis

  1. How was the incident detected?
    1. Deployment Failure
  2. How could detection time be improved?
    1. Monitoring saturation of NAT port consumption
  3. How was the root cause diagnosed?
    1. Validating from a GKE node, its inability to make a request to the external world, in our case, a docker pull was made on the command line.
  4. How could time to diagnosis be improved?
    1. Having monitoring for this would be a great first step
  5. How did we reach the point where we knew how to mitigate the impact?
    1. See above
  6. How could time to mitigation be improved?
    1. See above
  7. What went well?
    1. Collaboration amoungst various team members while working through both this as well as many other incidents at the same time.

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. A similar event occurred in our staging environment a few months ago: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12060
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    1. No
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. No

Lessons Learned

  1. We learned that we have gaps in our monitoring for our Cloud NAT device saturation

Guidelines

Incident Review Stakeholders

  1. @skarbek
  2. @dawsmith
Edited by John Skarbek