Test puma and new /readiness healthcheck probe on staging
Background
We currently use /-/health
as a healthcheck endpoint, there is a new healthcheck endpoint /readiness
being introduced that will allow us to remove instances from the LB much cleaner. This will reduce the downtime for sending a usr2
signal to puma during deployments and re-configures.
Once gitlab-org/gitlab!17960 (merged) is completed we will need to do some validation on staging before switching to the new /readiness
probe, which is necessary for puma configurations.
- Execute a testplan on staging that verifies no-downtime reloads with the new load balancer health check,
- We need to make sure that haproxy has time to mark the node as unhealthy, during the blackout period
Here are the tasks for validation:
-
Enable the web-exporter on a staging node with puma and ensure the blackout period is working properly -
Evaluate the current timeouts for HAProxy health checks, ensure that the current settings are appropriate for marking nodes as failed -
With some artificial load applied to staging, compare /-/health
errors to/readiness
errors during reload
Test summary
- We confirmed errors with the old health check and that were no errors on a
usr2
signal with the new health check in place https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8074#note_233258861 - HAProxy configuration looks ok, the blackout window will be long enough for nodes to be marked unhealthy https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8074#note_233258949
- With the large number of requests to the new
/readiness
endpoint we did notice one case of it getting overrun[2019-10-21T15:24:09.657+0000] ERROR Errno::ECONNRESET: Connection reset by peer @ io_fillbuf - fd:16
, we will continue to monitor this in canary to see if it is an issue as the number of requests will be much higher there
Edited by 🤖 GitLab Bot 🤖