Provision additional API nodes (again)
C2
Production Change - Criticality 2Change Objective | Scale out the API fleet again to mitigate the queuing described in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8076 |
---|---|
Change Type | Scale out request-serving cattle |
Services Impacted | API |
Change Team Members | @craigf |
Change Severity | How critical is the change |
Buddy check | A colleague will review the change |
Tested in staging | No |
Schedule of the change | 2019-10-10 1200 UTC |
Duration of the change | Minutes for the change, similar for rollback |
Adds even more nodes. #1225 (closed) did this earlier in the week.
Steps
- View the pgbouncer waiting client connections chart. Keep an eye on it as the new nodes start service requests. A sudden buildup of pressure at pgbouncer might be cause to roll back.
-
Merge and apply https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1101 to provision the nodes. -
Add the API nodes to haproxy: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1968 -
Wait for the nodes to begin serving traffic. This won't happen until they are chef'ed up and healthy, and have been added to the LB pool at the FE HAproxy nodes by a subsequent chef run there (no need to manually run chef). You can observe the flow of requests on-box in /var/log/gitlab/gitlab-rails/api_json.log
, or less intrusively in kibana (change the hostname filter as appropriate).
Rollback
<chef-repo>/bin/set-server-state gprd 'api-2[3-6]' drain
- Wait for requests to finish on those servers (use the logs again). This should eliminate any problems downstream, e.g. postgres.
- Revert https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1101 and apply
Edited by Cameron McFarland