Improve stability of Calico networking components (#2443) · Issues · GitLab.com / GitLab Infrastructure Team / delivery

Improve stability of Calico networking components

The current theory around what's causing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5165 is nginx pods being assigned and receiving traffic before they're ready. This causes it to overload. All the critical GKE provided components like `typha` and `kube-dns` have the following in their spec ``` tolerations: - key: CriticalAddonsOnly operator: Exists - key: components.gke.io/gke-managed-components operator: Exists ``` ## Proposed next steps: 1. Get the calico networking components to a stable state to see if this improves the issue. If it doesn't we at least rule out this being the cause. To do this we have 3 options: #### 1. Complete epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/393 typha is unstable because it get moved around different nodes in all parts of the cluster. Completing this epic would mean that it would be unable to be assigned to any node pool except the default one, which currently only runs nginx and thus shouldn't scale up and down as much. This option takes the most amount of time. #### 2. Create a tiny node pool in all 4 production clusters to just isolate GKE critical components. Meaning that for each cluster, we could create a tiny node pool with the taint ``` components.gke.io/gke-managed-components=true:NoSchedule ``` And then update the deployment `calico-typha` to add ``` spec: nodeSelector: cloud.google.com/gke-nodepool: new-tainted-node-pool ``` This would place it in an isolated node pool that should never need to scale up or down. I think if we manually modify the typha deployment object Google won't overwrite it (their automation seems to use kubectl behind the scenes to patch objects, so anything we should *should* be kept). Long term we should move to taints/tolerations everywhere and not rely on nodeSelector. This option takes a medium amount of time. #### 3. Put a nodeSelector on calico-typa to try and move it to a stable node pool This is basically a small part of the option above. We open a change request to manually run the following on each cluster in gprd ``` kubectl -n kube-system patch deployment/calico-typha -p '{"spec":{"nodeSelector":{"cloud.google.com/gke-nodepool": "default-2"}}}' ``` This would then at least try and keep typha running on the default node pool only, which should be more stable. The follow up from this would be using the following search to see if the number of typha connection errors has decreased https://cloudlogging.app.goo.gl/ygYmdLpcg4otfFD39. Ultimately however this might not be enough, as typha seems to get disrupted by node scaling caused by nginx (also running on default node pool).

issue