Continuous scale-out of ASG with idle_count=0
The runner appears to be having trouble scaling to 0
instances in Auto Scaling Group despite having no jobs to run.
I noticed this just today after updating self-hosted GitLab to v17.0.0
, but I can see from the logs that the problem has been occurring for some time and is applies to v0.4.0
and v0.5.0
of "Fleeting Plugin AWS". I conclude that it must be somehow related to one of the GitLab updates, as my installation is updated regularly and it certainly worked fine before.
Edit: The problem has probably been occurring singe the GitLab version was updated to 16.11.2
. Since then, I have seen incorrect activity of runner on the ASG "Activity history" side.
Below is an excerpt from the runner's logs. After a successful run of the test job:
(...)
2024-05-20T08:20:35.290780106Z Submitting job to coordinator...ok bytesize=14570 checksum=crc32:65261f50 code=200 job=31146 job-status=success runner=LximBY_cJ update-interval=0s
2024-05-20T08:20:35.292053661Z Removed job from processing list builds=0 job=31146 max_builds=12 project=1024 repo_url=https://*****/test-group/test-dind.git time_in_queue_seconds=4
2024-05-20T08:20:36.245872058Z required scaling change capacity-info={"InstanceCount":1,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=-1 runner=LximBY_cJ subsystem=taskscaler
The instance was not used for 15 minutes and it correctly took the action to scale-in ASG to 0:
2024-05-20T08:35:35.683367979Z instance marked for removal instance=i-0feda433fcbea2b32 reason=instance exceeded max idle time runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:36.080715719Z decreasing instances amount=1 group=aws/eu-central-1/ASG-GitLab-Executors runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:36.341645081Z instance update group=aws/eu-central-1/ASG-GitLab-Executors id=i-0feda433fcbea2b32 runner=LximBY_cJ state=deleting subsystem=taskscaler
But right after that it started scaling it out again:
2024-05-20T08:35:36.684515880Z required scaling change capacity-info={"InstanceCount":0,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":2,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=1 runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.447183223Z increasing instances amount=1 group=aws/eu-central-1/ASG-GitLab-Executors runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.571549390Z increasing instances response group=aws/eu-central-1/ASG-GitLab-Executors num_requested=1 num_successful=1 runner=LximBY_cJ subsystem=taskscaler
2024-05-20T08:35:37.571658394Z increase update group=aws/eu-central-1/ASG-GitLab-Executors pending=1 requesting=0 runner=LximBY_cJ subsystem=taskscaler total_pending=1
2024-05-20T08:35:37.685027567Z required scaling change capacity-info={"InstanceCount":1,"MaxInstanceCount":6,"Acquired":0,"UnavailableCapacity":2,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":2} required=0 runner=LximBY_cJ subsystem=taskscaler
The runner settings are as follows:
concurrent = 12
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-autoscaler-small"
output_limit = 16777216
url = "https://****/"
id = 506
token = "****"
token_obtained_at = 2024-05-20T09:14:32Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker-autoscaler"
environment = ["FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=1"]
[runners.custom_build_dir]
[runners.cache]
Type = "s3"
Shared = true
MaxUploadedArchiveSize = 0
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "****"
BucketLocation = "eu-central-1"
AuthenticationType = "iam"
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "alpine:latest"
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = true
disable_cache = false
volumes = ["/certs/client", "/cache"]
pull_policy = ["if-not-present"]
shm_size = 0
network_mtu = 0
[runners.autoscaler]
plugin = "fleeting-plugin-aws"
capacity_per_instance = 2
max_use_count = 20
max_instances = 6
[runners.autoscaler.plugin_config]
name = "ASG-GitLab-Executors"
[runners.autoscaler.connector_config]
username = "ec2-user"
use_external_addr = false
[[runners.autoscaler.policy]]
idle_count = 0
idle_time = "15m0s"
Manually scaling the ASG to 0 instances results in the runner calming down - it doesn't try to scale it out again. Until it receives any job, which again causes it to fall into the loop of keeping at least one (1) executor instance UP.