fix(gprd): workhorse 502 errors on shutdown (!1689) · Merge requests · GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab.com

Steve Xuereb requested to merge fix/gprd-workhorse-502 into master Apr 07, 2022

What does this MR do?

What

Increase the frequency of the readiness probe to be every 2 seconds from 10 seconds. We use the default and don't pass the all checks. The default checks if puma's pipe is available so it should be a lightweight call.
Reduce the failure threshold from 3 to 2.

This reduces our time to find a pod is not ready from 30 seconds (10 seconds * 3 failures) to 4 seconds ( 2 seconds * 2 failures).

Why

We see frequent SLO violations from GitLab Pages because it's sending 502 errors like gitlab-com/gl-infra/production#6767 (closed), gitlab-com/gl-infra/production#6640 (closed), and gitlab-com/gl-infra/production#6783 (closed). When we investigated where the 502 are coming from, these are coming from Workhorse because it can't connect to puma.

Looking at the timelines below as seen here this is because the readiness probe is slow to realize that the webservice container is no longer available so it keeps sending requests to this pod.

Pod `gitlab-cny-webservice-api-69d4c7fd65-vhv67`:
- `2022-04-05 06:03:49.000 UTC`: Readiness probe failed
- `2022-04-05 06:03:51.000 UTC`: `webservice` (127.0.0.1:8080) started shutdown.
- `2022-04-05 06:04:04.526 UTC`: puma shutdown finished.
- `2022-04-05 06:04:04.000 UTC` - `2022-04-05 06:04:46.000 UTC`: workhorse started serving 502 constantly.  42 seconds of serving 502 requests for any request that comes in apart from `/api/v4/jobs/request`

Pod `gitlab-webservice-api-5df8764d96-xwxh4`:
- [`2022-05-06 11:09:20.000 UTC`](https://log.gprd.gitlab.net/goto/63e1d340-b5a8-11ec-afaf-2bca15dfbf33): Readiness probe failed
- [`2022-05-06 11:38:18.000 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/1d7c16d0-c0fa-11ea-a0f8-0b8742fd907c/pubsub-gke-inf-gprd-000930?id=geqQ_n8BLHWdYfJKpdg7) : webservice started shutdown.
- [`2022-05-06 11:38:31.755 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/7092c4e2-4eb5-46f2-8305-a7da2edad090/pubsub-rails-inf-gprd-009459?id=LnOr_n8BYjN92f5x0Omt): puma shutdown finished.
- [`2022-05-06 11:38:31.000 UTC` - `2022-05-06 11:39:13.000 UTC`](https://log.gprd.gitlab.net/goto/39e95d80-b5a1-11ec-b73f-692cc1ae8214): workhorse started serving 502 constantly.

We've tested on gstg and gprd-cny and get around ~80% less 502 per pod:

gstg: !1686 (merged)
gprd-cny: !1688 (comment 904538853)

Monitoring

Dashboards:

reference https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497

Author Check-list

Please read the Contributing document and once you do, complete the following:

Assign to the correct reviewer per the contributing document
Apply the correct metadata per the contributing document
Link to related MRs for applying the changes on other environments
Link to related Chef changes
If necessary link to a Criticality 4 Change Request issue

Reviewer Check-list

Reviewed the diff jobs to confirm changes are as expected
No changes shown in the diffs not associated with this MR - This may require a rebase or further investigation

Applier Check-list

Make sure there is no ongoing deployment for the affected envs before merging (see #announcements slack channel)

Edited Apr 07, 2022 by Steve Xuereb

fix(gprd): workhorse 502 errors on shutdown