fix(gprd): workhorse 502 errors on shutdown
What does this MR do?
What
- Increase the frequency of the readiness probe to be every 2 seconds
from 10 seconds. We use the
default
and don't pass the
all
checks. The default checks if puma's pipe is available so it should be a lightweight call. - Reduce the failure threshold from
3
to2
.
This reduces our time to find a pod is not ready from 30 seconds (10 seconds * 3 failures) to 4 seconds ( 2 seconds * 2 failures).
Why
We see frequent SLO violations from GitLab Pages because it's sending
502
errors like
gitlab-com/gl-infra/production#6767 (closed),
gitlab-com/gl-infra/production#6640 (closed), and
gitlab-com/gl-infra/production#6783 (closed). When we
investigated where the 502 are coming
from,
these are coming from Workhorse
because it can't connect to puma.
Looking at the timelines below as seen
here
this is because the readiness probe is slow to realize that the
webservice
container is no longer available so it keeps sending
requests to this pod.
Pod `gitlab-cny-webservice-api-69d4c7fd65-vhv67`:
- `2022-04-05 06:03:49.000 UTC`: Readiness probe failed
- `2022-04-05 06:03:51.000 UTC`: `webservice` (127.0.0.1:8080) started shutdown.
- `2022-04-05 06:04:04.526 UTC`: puma shutdown finished.
- `2022-04-05 06:04:04.000 UTC` - `2022-04-05 06:04:46.000 UTC`: workhorse started serving 502 constantly. 42 seconds of serving 502 requests for any request that comes in apart from `/api/v4/jobs/request`
Pod `gitlab-webservice-api-5df8764d96-xwxh4`:
- [`2022-05-06 11:09:20.000 UTC`](https://log.gprd.gitlab.net/goto/63e1d340-b5a8-11ec-afaf-2bca15dfbf33): Readiness probe failed
- [`2022-05-06 11:38:18.000 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/1d7c16d0-c0fa-11ea-a0f8-0b8742fd907c/pubsub-gke-inf-gprd-000930?id=geqQ_n8BLHWdYfJKpdg7) : webservice started shutdown.
- [`2022-05-06 11:38:31.755 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/7092c4e2-4eb5-46f2-8305-a7da2edad090/pubsub-rails-inf-gprd-009459?id=LnOr_n8BYjN92f5x0Omt): puma shutdown finished.
- [`2022-05-06 11:38:31.000 UTC` - `2022-05-06 11:39:13.000 UTC`](https://log.gprd.gitlab.net/goto/39e95d80-b5a1-11ec-b73f-692cc1ae8214): workhorse started serving 502 constantly.
We've tested on gstg
and gprd-cny
and get around ~80% less 502 per
pod:
-
gstg
: !1686 (merged) -
gprd-cny
: !1688 (comment 904538853)
Monitoring
Dashboards:
- HealthController
- api: Kube Deployment Detail
- api: Kube Container Detail Logs:
- workhorse badgateway logs
reference https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497
Author Check-list
Please read the Contributing document and once you do, complete the following:
-
Assign to the correct reviewer per the contributing document -
Apply the correct metadata per the contributing document -
Link to related MRs for applying the changes on other environments -
Link to related Chef changes -
If necessary link to a Criticality 4 Change Request issue
Reviewer Check-list
-
Reviewed the diff jobs to confirm changes are as expected -
No changes shown in the diffs not associated with this MR - This may require a rebase or further investigation
Applier Check-list
-
Make sure there is no ongoing deployment for the affected envs before merging (see #announcements slack channel)