Skip to content

fix(gprd): workhorse 502 errors on shutdown

Steve Xuereb requested to merge fix/gprd-workhorse-502 into master

What does this MR do?

What

  • Increase the frequency of the readiness probe to be every 2 seconds from 10 seconds. We use the default and don't pass the all checks. The default checks if puma's pipe is available so it should be a lightweight call.
  • Reduce the failure threshold from 3 to 2.

This reduces our time to find a pod is not ready from 30 seconds (10 seconds * 3 failures) to 4 seconds ( 2 seconds * 2 failures).

Why

We see frequent SLO violations from GitLab Pages because it's sending 502 errors like gitlab-com/gl-infra/production#6767 (closed), gitlab-com/gl-infra/production#6640 (closed), and gitlab-com/gl-infra/production#6783 (closed). When we investigated where the 502 are coming from, these are coming from Workhorse because it can't connect to puma.

Looking at the timelines below as seen here this is because the readiness probe is slow to realize that the webservice container is no longer available so it keeps sending requests to this pod.

Pod `gitlab-cny-webservice-api-69d4c7fd65-vhv67`:
- `2022-04-05 06:03:49.000 UTC`: Readiness probe failed
- `2022-04-05 06:03:51.000 UTC`: `webservice` (127.0.0.1:8080) started shutdown.
- `2022-04-05 06:04:04.526 UTC`: puma shutdown finished.
- `2022-04-05 06:04:04.000 UTC` - `2022-04-05 06:04:46.000 UTC`: workhorse started serving 502 constantly.  42 seconds of serving 502 requests for any request that comes in apart from `/api/v4/jobs/request`

Pod `gitlab-webservice-api-5df8764d96-xwxh4`:
- [`2022-05-06 11:09:20.000 UTC`](https://log.gprd.gitlab.net/goto/63e1d340-b5a8-11ec-afaf-2bca15dfbf33): Readiness probe failed
- [`2022-05-06 11:38:18.000 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/1d7c16d0-c0fa-11ea-a0f8-0b8742fd907c/pubsub-gke-inf-gprd-000930?id=geqQ_n8BLHWdYfJKpdg7) : webservice started shutdown.
- [`2022-05-06 11:38:31.755 UTC`](https://log.gprd.gitlab.net/app/discover#/doc/7092c4e2-4eb5-46f2-8305-a7da2edad090/pubsub-rails-inf-gprd-009459?id=LnOr_n8BYjN92f5x0Omt): puma shutdown finished.
- [`2022-05-06 11:38:31.000 UTC` - `2022-05-06 11:39:13.000 UTC`](https://log.gprd.gitlab.net/goto/39e95d80-b5a1-11ec-b73f-692cc1ae8214): workhorse started serving 502 constantly.

We've tested on gstg and gprd-cny and get around ~80% less 502 per pod:

Monitoring

Dashboards:

reference https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497

Author Check-list

Please read the Contributing document and once you do, complete the following:

  • Assign to the correct reviewer per the contributing document
  • Apply the correct metadata per the contributing document
  • Link to related MRs for applying the changes on other environments
  • Link to related Chef changes
  • If necessary link to a Criticality 4 Change Request issue

Reviewer Check-list

  • Reviewed the diff jobs to confirm changes are as expected
  • No changes shown in the diffs not associated with this MR - This may require a rebase or further investigation

Applier Check-list

  • Make sure there is no ongoing deployment for the affected envs before merging (see #announcements slack channel)
Edited by Steve Xuereb

Merge request reports