Skip to content

Use the new web-exporter health check endpoint on production

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange|HotFix|DeploymentNewFeature|Operation
Services Impacted All rails services (api, git, web)
Change Team Members @jarv
Change Severity C3
Change Reviewer or tested in staging TBD
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Time tracking To estimate and record times associated with changes ( including a possible rollback )

Summary

GitLab.com currently uses /-/health as the haproxy health check for the following backends

  • websockets
  • ssh
  • canary_api
  • canary_https_git
  • canary_web
  • web
  • api_rate_limit

With the new web_exporter healthcheck endpoing gitlab-org/omnibus-gitlab!3650 (merged) we can now use /readiness which supports a blackout period, this ensures that when we reload nodes will flip to unavailable immediately for 10 seconds, allow haproxy to cleanly remove them from the server pool.

This change is to enable in the /readiness healthcheck on haproxy in advance of enabling puma, since this healthcheck is required before we enable puma. /readiness is supported for both unicorn and puma. One major difference between staging and production is that this endpoint will see many more requests because of the larger fleet of haproxy nodes. Some testing was done on staging up to 300 req/second to increase our confidence that the additional load will not be an issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8074#note_233727425

/readiness is working for all production nodes that have rails running:

$ curl http://web-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl http://api-01-sv-gc.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl http://git-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

$ curl web-cny-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}

Procedure

  • Stop chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client stop'
  • Apply environment update for prod https://ops.gitlab.net/gitlab-cookbooks/chef-repo/pipelines/87690
  • Merge and update the gprd-base-lb-fe-common for the new exporter https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2015
  • Run chef on fe-01-lb-gprd.c.gitlab-production.internal. sudo chef-client
  • Validate the health of all servers in all backends on single haproxy ssh -nNTL 7331:localhost:7331 fe-01-lb-gprd.c.gitlab-production.internal
  • Enable chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client start'

Rollback

Edited by John Jarvis