2021-03-15 Cloud NAT in gprd is at its limit
Summary
The current deploy is having issues stuck in imagepull backoff loops. It looks like all ports for the CloudNAT for our GKE cluster are occupied, intermittently blocking the ability to do image pulls.
Timeline
All times UTC.
2021-03-15
-
14:41
- Deployment to Kubernetes Infrastructure times out -
14:49
- First attempted deploy to production failed in Kubernetes -
19:24
- Mitigation via modification to deployments to stagger cluster deploys
Corrective Actions
- Modification to our deployments to stagger cluster jobs: #3981 (comment 530865965)
- We do not monitor port availability from out Cloud NAT devices. This lead to port exhaustion unknowingly impacting deployments and likely other systems - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12541
- We do not fully understand the IP allocation capabilities and customization capabilities needed to grow any further - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12542
Incident Review
Summary
We leverage the use of Google's Cloud NAT device to provide the capability of a preset configure set of IP addresses for which external network traffic can egress allowing customers to more confidently define which IP blocks can be allowed to reach into their infrastructure endpoints. This enables us to more securely configure our clusters in GCP as private, such that the nodes do not have public facing IP addresses. Cloud NAT devices have a limited set of connections that are allowed to traverse the device, in this case 65,536 allowed ports per IP address. At the time of the incident, we were using 14 IP addresses, providing us with a maximum capacity of 917504 available connections for ANY type of traffic going in or out of our GKE infrastructure from the outside world. Due to being unable to expand our IP address range due to the need to notify customers ahead of time in order to ensure that traffic to heavily firewalled customers does not break, we've elected to see if we can stagger deployments.
- Service(s) affected: Deployments were the trigger, but all services running inside of Kubernetes would be impacted by this. This includes the following: Mailroom, Container Registry, Git SSH/HTTPS traffic, Websockets, PlantUML, and Sidekiq
- Team attribution: Infrastructure
- Time to detection: 60 minutes
- Minutes downtime or degradation: We will remain in a high state of risk until the corrective actions created back in January are appropriately taken care of.
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Any user that attempted to use a service running on Kubernetes will have a chance to had been impacted.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Connection Refused or Connection Timeout
-
How many customers were affected?
- Unknown
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Unknown
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- Deployment Failure
-
How could detection time be improved?
- Monitoring saturation of NAT port consumption
-
How was the root cause diagnosed?
- Validating from a GKE node, its inability to make a request to the external world, in our case, a docker pull was made on the command line.
-
How could time to diagnosis be improved?
- Having monitoring for this would be a great first step
-
How did we reach the point where we knew how to mitigate the impact?
- See above
-
How could time to mitigation be improved?
- See above
-
What went well?
- Collaboration amoug various team members while working through both this as well as many other incidents at the same time.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- A similar event occurred in our staging environment a few months ago: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12060
- And then again in January: #3448 (closed)
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Yes
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Indirectly - code change resulted in the infrastructure needing to create network connections to a service in order to perform the deploy
Lessons Learned
- We learned that we have gaps in our monitoring for our Cloud NAT device saturation
- As we advertise the range of IP's that we utilize for our customer base, we've elected to not expand that range until we've had ample time to notify customers ahead of such a change