Reduce liveness probes interval

Summary

In production#7714 (closed) we saw just 1 container that hung for 1 minute, which served around 10k 502 errors for less than 1 minute, which is less than our liveness probe, which triggered this 1h window alert. Our readiness probe failed a lot more often which removed it from the service.

If the container is unresponsive, we should delete the container, and not deal with it at all, we can use the livenessProbe to a shorter interval, similar to what we've done to the readinessProbe so that container get rotated much quicker and killed more often if they are acting up.

Current LivenessProbe

$ kubectl -n gitlab get po gitlab-webservice-web-5f68558ff9-blc48 -o jsonpath='{range .spec.containers[*]}{.name}{": \t"}{.livenessProbe}{"\n"}{end}'
webservice:     {"failureThreshold":3,"httpGet":{"path":"/-/liveness","port":8080,"scheme":"HTTP"},"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}
gitlab-workhorse:       {"exec":{"command":["/scripts/healthcheck"]},"failureThreshold":3,"initialDelaySeconds":20,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":30}

Related Incident(s)

Originating issue(s):

Desired Outcome/Acceptance Criteria

On-call doesn't get paged when a single pod becomes unresponsive and can't service traffic which results into a bunch of 5xx errors

Associated Services

ServiceWeb

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'Reliability::P4')

Results

We no longer see spikes on the error ratio for the loadbalancer for the web service.

source

Edited Jan 11, 2023 by Steve Xuereb