Investigate Calico connectivity issues during deployments
Summary
During a recent deployment, EOC was alerted with the following Alert:
Firing 1 - Containers for the
git
service,main
are unable to unable to start. More than 50% of the deployment's maxSurge setting consists of containers unable to start for reasons other than ContainerCreating.
Prometheus showed the following, query:
A quick investigation showed the following errors in the affected pods (more context here: production#6268 (comment 829420838))
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0bee02cbc1e95d8ed1095632c0948dd8ee0eaae233793754fde926965a218a0c": failed to find plugin "calico" in path [/home/kubernetes/bin]
This issue's aim is to understand this error and decide if there's any actions that should be taken.
Related Incident(s)
Originating issue(s): production#6268 (closed)
Desired Outcome/Acceptance criteria
-
The error observed is understood. -
Evaluate whether there's any action that can be taken to prevent this from re-occurring.
Associated Services
Corrective Action Issue Checklist
-
link the incident(s) this corrective action arose out of -
give context for what problem this corrective action is trying to prevent from re-occurring -
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
assign a priority (this will default to 'priority::4')