Skip to content

Continuous scale-out of ASG with idle_count=0

The runner appears to be having trouble scaling to 0 instances in Auto Scaling Group despite having no jobs to run.

I noticed this just today after updating self-hosted GitLab to v17.0.0, but I can see from the logs that the problem has been occurring for some time and is applies to v0.4.0 and v0.5.0 of "Fleeting Plugin AWS". I conclude that it must be somehow related to one of the GitLab updates, as my installation is updated regularly and it certainly worked fine before.

Edit: The problem has probably been occurring singe the GitLab version was updated to 16.11.2. Since then, I have seen incorrect activity of runner on the ASG "Activity history" side.

Below is an excerpt from the runner's logs. After a successful run of the test job:

(...)
2024-05-20T08:20:35.290780106Z Submitting job to coordinator...ok                  bytesize=14570 checksum=crc32:65261f50 code=200 job=31146 job-status=success runner=LximBY_cJ update-interval=0s
2024-05-20T08:20:35.292053661Z Removed job from processing list                    builds=0 job=31146 max_builds=12 project=1024 repo_url=https://*****/test-group/test-dind.git time_in_queue_seconds=4
2024-05-20T08:20:36.245872058Z required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=-1 runner=LximBY_cJ subsystem=taskscaler

The instance was not used for 15 minutes and it correctly took the action to scale-in ASG to 0:

2024-05-20T08:35:35.683367979Z instance marked for removal                         instance=i-0feda433fcbea2b32 reason=instance exceeded max idle time runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:36.080715719Z decreasing instances                                amount=1 group=aws/eu-central-1/ASG-GitLab-Executors runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:36.341645081Z instance update                                     group=aws/eu-central-1/ASG-GitLab-Executors id=i-0feda433fcbea2b32 runner=LximBY_cJ state=deleting subsystem=taskscaler

But right after that it started scaling it out again:

2024-05-20T08:35:36.684515880Z required scaling change                             capacity-info={"InstanceCount":0,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":2,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=1 runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.447183223Z increasing instances                                amount=1 group=aws/eu-central-1/ASG-GitLab-Executors runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.571549390Z increasing instances response                       group=aws/eu-central-1/ASG-GitLab-Executors num_requested=1 num_successful=1 runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.571658394Z increase update                                     group=aws/eu-central-1/ASG-GitLab-Executors pending=1 requesting=0 runner=LximBY_cJ subsystem=taskscaler total_pending=1
2024-05-20T08:35:37.685027567Z required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":2,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=0 runner=LximBY_cJ subsystem=taskscaler

The runner settings are as follows:

concurrent = 12
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-autoscaler-small"
  output_limit = 16777216
  url = "https://****/"
  id = 506
  token = "****"
  token_obtained_at = 2024-05-20T09:14:32Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker-autoscaler"
  environment = ["FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=1"]
  [runners.custom_build_dir]
  [runners.cache]
    Type = "s3"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      BucketName = "****"
      BucketLocation = "eu-central-1"
      AuthenticationType = "iam"
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = true
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    pull_policy = ["if-not-present"]
    shm_size = 0
    network_mtu = 0
  [runners.autoscaler]
    plugin = "fleeting-plugin-aws"
    capacity_per_instance = 2
    max_use_count = 20
    max_instances = 6
    [runners.autoscaler.plugin_config]
      name             = "ASG-GitLab-Executors"
    [runners.autoscaler.connector_config]
      username          = "ec2-user"
      use_external_addr = false
    [[runners.autoscaler.policy]]
      idle_count = 0
      idle_time = "15m0s"

Manually scaling the ASG to 0 instances results in the runner calming down - it doesn't try to scale it out again. Until it receives any job, which again causes it to fall into the loop of keeping at least one (1) executor instance UP.

Edited by Tom Skibinski
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information