Use the new web-exporter health check endpoint on production
C3
Production Change - Criticality 3Change Objective | Describe the objective of the change |
---|---|
Change Type | ConfigurationChange|HotFix|DeploymentNewFeature|Operation |
Services Impacted | All rails services (api, git, web) |
Change Team Members | @jarv |
Change Severity | C3 |
Change Reviewer or tested in staging | TBD |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Summary
GitLab.com currently uses /-/health
as the haproxy health check for the following backends
- websockets
- ssh
- canary_api
- canary_https_git
- canary_web
- web
- api_rate_limit
With the new web_exporter healthcheck endpoing gitlab-org/omnibus-gitlab!3650 (merged) we can now use /readiness
which supports a blackout period, this ensures that when we reload nodes will flip to unavailable immediately for 10 seconds, allow haproxy to cleanly remove them from the server pool.
This change is to enable in the /readiness
healthcheck on haproxy in advance of enabling puma, since this healthcheck is required before we enable puma. /readiness
is supported for both unicorn and puma. One major difference between staging and production is that this endpoint will see many more requests because of the larger fleet of haproxy nodes. Some testing was done on staging up to 300 req/second to increase our confidence that the additional load will not be an issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8074#note_233727425
/readiness
is working for all production nodes that have rails running:
$ curl http://web-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}
$ curl http://api-01-sv-gc.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}
$ curl http://git-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}
$ curl web-cny-01-sv-gprd.c.gitlab-production.internal:8083/readiness
{"status":"ok","web_exporter":[{"status":"ok"}],"unicorn_check":[{"status":"ok"}]}
Procedure
-
Stop chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client stop'
-
Apply environment update for prod https://ops.gitlab.net/gitlab-cookbooks/chef-repo/pipelines/87690 -
Merge and update the gprd-base-lb-fe-common
for the new exporter https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2015 -
Run chef on fe-01-lb-gprd.c.gitlab-production.internal
.sudo chef-client
-
Validate the health of all servers in all backends on single haproxy ssh -nNTL 7331:localhost:7331 fe-01-lb-gprd.c.gitlab-production.internal
-
Enable chef on all frontend haproxy nodes knife ssh 'roles:gprd-base-lb-fe-common' 'sudo service chef-client start'
Rollback
-
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2015 -
update the gprd-base-lb-fe-common
knife role from file gprd-base-lb-fe-common.json
-
force chef runs on all haproxy servers knife ssh 'roles:gprd-base-lb-fe-common' 'sudo chef-client'