Reduce liveness probes interval
Summary
In production#7714 (closed) we saw just 1 container that hung for 1 minute, which served around 10k 502
errors for less than 1 minute, which is less than our liveness probe, which triggered this 1h window alert. Our readiness probe failed a lot more often which removed it from the service.
If the container is unresponsive, we should delete the container, and not deal with it at all, we can use the livenessProbe
to a shorter interval, similar to what we've done to the readinessProbe
so that container get rotated much quicker and killed more often if they are acting up.
Current LivenessProbe
$ kubectl -n gitlab get po gitlab-webservice-web-5f68558ff9-blc48 -o jsonpath='{range .spec.containers[*]}{.name}{": \t"}{.livenessProbe}{"\n"}{end}'
webservice: {"failureThreshold":3,"httpGet":{"path":"/-/liveness","port":8080,"scheme":"HTTP"},"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}
gitlab-workhorse: {"exec":{"command":["/scripts/healthcheck"]},"failureThreshold":3,"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}
Related Incident(s)
Originating issue(s):
Desired Outcome/Acceptance Criteria
-
On-call doesn't get paged when a single pod becomes unresponsive and can't service traffic which results into a bunch of 5xx errors
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')
Results
We no longer see spikes on the error ratio for the loadbalancer
for the web
service.
Edited by Steve Xuereb