Improve stability of Calico networking components
The current theory around what's casing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5165 is nginx pods being assigned and receiving traffic before they're ready. This causes it to overload.
All the critical GKE provided components like typha and kube-dns have the following in their spec
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: components.gke.io/gke-managed-components
operator: Exists
Proposed next steps:
- Get the calico networking components to a stable state to see if this improves the issue. If it doesn't we at least rule out this being the cause.
To do this we have 3 options:
1. Complete epic &393 (closed)
typha is unstable because it get moved around different nodes in all parts of the cluster. Completing this epic would mean that it would be unable to be assigned to any node pool except the default one, which currently only runs nginx and thus shouldn't scale up and down as much. This option takes the most amount of time.
2. Create a tiny node pool in all 4 production clusters to just isolate GKE critical components.
Meaning that for each cluster, we could create a tiny node pool with the taint
components.gke.io/gke-managed-components=true:NoSchedule
And then update the deployment calico-typha to add
spec:
nodeSelector:
cloud.google.com/gke-nodepool: new-tainted-node-pool
This would place it in an isolated node pool that should never need to scale up or down. I think if we manually modify the typha deployment object Google won't overwrite it (their automation seems to use kubectl behind the scenes to patch objects, so anything we should should be kept). Long term we should move to taints/tolerations everywhere and not rely on nodeSelector.
This option takes a medium amount of time.
3. Put a nodeSelector on calico-typa to try and move it to a stable node pool
This is basically a small part of the option above.
We open a change request to manually run the following on each cluster in gprd
kubectl -n kube-system patch deployment/calico-typha -p '{"spec":{"nodeSelector":{"cloud.google.com/gke-nodepool": "default-2"}}}'
This would then at least try and keep typha running on the default node pool only, which should be more stable. The follow up from this would be using the following search to see if the number of typha connection errors has decreased https://cloudlogging.app.goo.gl/ygYmdLpcg4otfFD39. Ultimately however this might not be enough, as typha seems to get disrupted by node scaling caused by nginx (also running on default node pool).