Cloud NAT missing saturation metrics for port usage
In recent incident: production#3448 (closed)
We discovered that we had fully saturated all available ports for the Cloud NAT device that serves as the gateway for all Internet traffic to and from our gke clusters in gprd. Utilize this issue to determine how we can improve both alerting, and our dashboards for monitoring the saturation level of these devices.
This is the second time this has occurred. Let's attempt to be ahead of the curve prior to this occurring a third time.
Milestones
-
Appropriate metrics are put into place to monitor Cloud NAT usage -
Alerts are added when we exceed a threshold to warn us that we need to take action - such as adding an IP address to the Cloud NAT device -
Dashboards are updated with saturation metrics