gitlab-runner 16.11.1 never scales to zero, immediately re-launches instances with AWS fleeting plugin
Summary
After updating from gitlab-runner
16.11.0 to 16.11.1, it is unable to scale our AWS auto-scaling group to zero with the AWS fleeting plugin. After scaling to zero, it immediately launches a replacement instance even though we have idle_count = 0
.
Steps to reproduce
- Deploy
gitlab-runner
16.11.1 using the AWS fleeting plugin 0.4.0, configured with an idle count of 0 and idle time of e.g. 5mins - Run any job
- Wait until the configured
idle_time
elapses
Actual behavior
After the configured idle time, the idle EC2 instance is marked for removal ("instance marked for removal"), and terminated successfully.
Immediately after requesting the instance be terminated, a new instance is launched to replace it with the log messages "required scaling change" / "increasing instances".
If you allow the idle_time
to elapse again without running any jobs, this happens again - the idle instance is terminated after running no jobs, and a replacement instance is immediately launched. It also seems to log a lot of "required scaling change" messages when nothing is happening, with "required":0
and "required":-1
, though I'm not sure if this is abnormal.
The effect of this is that it is no longer possible to scale to zero with the AWS fleeting plugin with gitlab-runner
16.11.1.
Downgrading to 16.11.0 fixes this problem.
Expected behavior
After the configured idle time, the instance should be terminated and no replacement instance should be launched (due to idle_count = 0
)
Relevant logs and/or screenshots
Runner log after first idle timeout
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:31Z"}
{"instance":"i-old-instance-id","level":"info","msg":"instance marked for removal","reason":"instance exceeded max idle time","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:31Z"}
{"amount":1,"group":"aws/my-region/my-asg-name","level":"info","msg":"decreasing instances","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:31Z"}
{"group":"aws/my-region/my-asg-name","id":"i-old-instance-id","level":"info","msg":"instance update","runner":"my-runner-id","state":"deleting","subsystem":"taskscaler","time":"2024-05-12T21:36:32Z"}
{"capacity-info":"{\"InstanceCount\":0,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":1,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:32Z"}
{"amount":1,"group":"aws/my-region/my-asg-name","level":"info","msg":"increasing instances","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:33Z"}
{"group":"aws/my-region/my-asg-name","level":"info","msg":"increasing instances response","num_requested":1,"num_successful":1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:33Z"}
{"group":"aws/my-region/my-asg-name","level":"info","msg":"increase update","pending":1,"requesting":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:33Z","total_pending":1}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":1,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:36:33Z"}
Runner log with second idle timeout
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:38:14Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:39:04Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:39:05Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:39:55Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:39:56Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:40:46Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:40:47Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:41:37Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":0,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:41:38Z"}
{"capacity-info":"{\"InstanceCount\":1,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":0,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":-1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:28Z"}
{"instance":"i-second-instance-now-idle","level":"info","msg":"instance marked for removal","reason":"instance exceeded max idle time","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:28Z"}
{"amount":1,"group":"aws/my-region/my-asg-name","level":"info","msg":"decreasing instances","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:29Z"}
{"group":"aws/my-region/my-asg-name","id":"i-second-instance-now-idle","level":"info","msg":"instance update","runner":"my-runner-id","state":"deleting","subsystem":"taskscaler","time":"2024-05-12T21:42:29Z"}
{"capacity-info":"{\"InstanceCount\":0,\"MaxInstanceCount\":1000,\"Acquired\":0,\"UnavailableCapacity\":1,\"Pending\":0,\"Reserved\":1,\"IdleCount\":0,\"ScaleFactor\":0,\"ScaleFactorLimit\":0,\"CapacityPerInstance\":1}","level":"info","msg":"required scaling change","required":1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:29Z"}
{"amount":1,"group":"aws/my-region/my-asg-name","level":"info","msg":"increasing instances","runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:30Z"}
{"group":"aws/my-region/my-asg-name","level":"info","msg":"increasing instances response","num_requested":1,"num_successful":1,"runner":"my-runner-id","subsystem":"taskscaler","time":"2024-05-12T21:42:30Z"}
Environment description
Self-hosted gitlab runner, registered with a subgroup on gitlab.com.
gitlab-runner
16.11.1
fleeting-plugin-aws
0.4.0
config.toml contents
concurrent = 100
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "autogenerated-runner-name"
limit = 100
output_limit = 102400
url = "https://gitlab.com"
id = 34186331
token = "glrt-token-goes-here"
token_obtained_at = 2024-05-12T21:27:52Z
token_expires_at = 0001-01-01T00:00:00Z
tls-ca-file = "/etc/gitlab-runner/certs/ca.pem"
executor = "docker-autoscaler"
[runners.custom_build_dir]
[runners.cache]
Type = "s3"
MaxUploadedArchiveSize = 0
[runners.cache.s3]
BucketName = "my-bucket-name"
BucketLocation = "my-region"
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "..."
dns = ["..."]
dns_search = ["..."]
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["..."]
wait_for_services_timeout = 300
shm_size = 0
network_mtu = 0
[runners.autoscaler]
capacity_per_instance = 1
max_use_count = 5
max_instances = 0
plugin = "fleeting-plugin-aws"
[runners.autoscaler.plugin_config]
name = "my-asg-name"
[runners.autoscaler.connector_config]
username = "ec2-user"
password = ""
key_path = ""
use_static_credentials = false
keepalive = "0s"
timeout = "0s"
use_external_addr = false
[[runners.autoscaler.policy]]
idle_count = 0
idle_time = "5m0s"
scale_factor = 0.0
scale_factor_limit = 0
Used GitLab Runner version
public.ecr.aws/gitlab/gitlab-runner:latest
(sha256:911b6b49538b6bed2e4fce04688febd78824f6ed42fe4d5746284a914b5e0eff
)
Version: 16.11.1
Git revision: 535ced5f
Git branch: 16-11-stable
GO version: go1.21.9
Built: 2024-05-03T15:52:38+0000
OS/Arch: linux/amd64
Possible fixes
The 16.11.1 changelog mentions the fleeting and taskscaler were updated (!4745 (merged)). It seems likely that something in that change is the cause of this change in scaling behavior - possibly the capacity calculation changes in gitlab-org/fleeting/taskscaler!46 (merged), though I don't see anything specific that would cause immediate scaling like this.