Macos autoscaling groups to use more than one availability zone
Currently both the green and blue deployments for the macos shards use the us-east-1a
zone only.
We have had occurrences where this zone has run out of mac2.metal
dedicated hosts. Presumably this could happen with any of the newer models too.
This becomes a problem when doing a blue/green deployment, where the newer deployment cannot scale up appropriately until resources are freed from the older one. This causes long job pending durations (1hr+) when this deployment occurs (if we are at capacity for the zone).
This could also theoretically be a problem if a group is scaled down due to a smaller number of jobs running. Once the number of hosts is reduced, we may not be able to scale up to meet demand.
We might consider having the two scaling groups (per deployment) in separate zones, e.g. us-east-1a
and us-east-1b
. This should spread the capacity requests out better and may allow a new deployment to scale in at least one zone if the other has insufficient hosts.
Steps / History:
-
add config options to separate zones for ASGs 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9914+s -
use latest working runner version 👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5449 -
issues with deployment investigation issue for more info 👉 https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/issues/275 -
we are deploying a pre 17.7 runner version (17.4), so we need to set a new value, install_helper_images_package
, to false👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5450+s -
following AWS ASG error output, setting to AZs that should have availability 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/10048+s -
we've confirmed the splitting works and jobs were picked up but we are still playing whack-a-mole with AZ availability. Comment here with details 👉 #254 (comment 2296883482)