Use the new web-exporter health check endpoint on production

Production Change - Criticality 3 C3

Change Objective	Describe the objective of the change
Change Type	ConfigurationChange\|HotFix\|DeploymentNewFeature\|Operation
Services Impacted	All rails services (api, git, web)
Change Team Members	@jarv
Change Severity	C3
Change Reviewer or tested in staging	TBD
Dry-run output	If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date	Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Time tracking	To estimate and record times associated with changes ( including a possible rollback )

Summary

GitLab.com currently uses /-/health as the haproxy health check for the following backends

websockets
ssh
canary_api
canary_https_git
canary_web
web
api_rate_limit

With the new web_exporter healthcheck endpoing gitlab-org/omnibus-gitlab!3650 (merged) we can now use /readiness which supports a blackout period, this ensures that when we reload nodes will flip to unavailable immediately for 10 seconds, allow haproxy to cleanly remove them from the server pool.

This change is to enable in the /readiness healthcheck on haproxy in advance of enabling puma, since this healthcheck is required before we enable puma. /readiness is supported for both unicorn and puma. One major difference between staging and production is that this endpoint will see many more requests because of the larger fleet of haproxy nodes. Some testing was done on staging up to 300 req/second to increase our confidence that the additional load will not be an issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8074#note_233727425

/readiness is working for all production nodes that have rails running:

$ curl http://web-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl http://api-01-sv-gc.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl http://git-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl web-cny-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

Procedure

Stop chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client stop'
Apply environment update for prod https://ops.gitlab.net/gitlab-cookbooks/chef-repo/pipelines/87690
Merge and update the gprd-base-lb-fe-common for the new exporter https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2015
Run chef on fe-01-lb-gprd.c.gitlab-production.internal. sudo chef-client
Validate the health of all servers in all backends on single haproxy ssh -nNTL 7331:localhost:7331 fe-01-lb-gprd.c.gitlab-production.internal
Enable chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client start'

Rollback

Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2015
update the gprd-base-lb-fe-common knife role from file gprd-base-lb-fe-common.json
force chef runs on all haproxy servers knife ssh 'roles:gprd-base-lb-fe-common' 'sudo chef-client'

Edited Oct 22, 2019 by John Jarvis