Unexpected idle_count/idle_time behaviour with docker-autoscaler
Summary
Hi folks – I'm evaluating the new docker-autoscaler executor as a replacement for the Docker Machine executor, using fleeting-plugin-openstack to run jobs on OpenStack. For testing purposes, I've set idle_count=1 and idle_time=20m, which I would expect to mean "keep one idle instance around, restarting it every 20 minutes" (purely for the sake of testing whether it's actually able to do that correctly), but I'm seeing odd behaviour:
- I start gitlab-runner, it creates an instance (correct)
- I run a job, a second instance gets started (correct, since idle_count=1 means "keep a spare instance around")
- The job finishes, both instances stick around (correct, since neither were idle for idle_time yet)
- After idle_time, the second instance is destroyed (correct)
- Another instance is immediately created (not correct! the first instance is idle and satisfies the "keep one idle instance around at all times" constraint)
- and from then on, two idle instances are kept around instead of one...
My understanding of the flow of the logs below is that:
- the job finishes
- separately, taskscaler decides that the "second" instance above is idle and should be destroyed:
May 16 09:53:20 gitlab-runner[60308]: instance marked for removal instance=94b38d55-9bf2-48e1-97c7-5e57d5547513 reason=instance exceeded max idle time runner=si2_DV96E subsystem=taskscaler
- it's deleted, but according to fleeting-plugin-openstack, OpenStack still reports it as existing for a moment:
May 16 09:53:21 gitlab-runner[60308]: ERROR: out-of-sync capacity actual=2 actualServers=[94b38d55-9bf2-48e1-97c7-5e57d5547513 4baef4b3-a0ee-4eb1-ac03-580367faae29] actualStates=[ACTIVE ACTIVE] cloud=openstack expected=1 name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:21.733Z
- note, I've patched fleeting-plugin-openstack to log the instances/states OpenStack reports here (
actualServers
andactualStates
)
- note, I've patched fleeting-plugin-openstack to log the instances/states OpenStack reports here (
- perhaps that confuses taskscaler? because for whatever reason, it then decides a new instance is required:
May 16 09:53:21 gitlab-runner[60308]: scale up n=1 rate-limit-n=1 runner=si2_DV96E subsystem=taskscaler
Without knowing anything about taskscaler's internals (and without a proper specification for the expected behaviour of a fleeting plugin!), I'm struggling to figure out anything further here. Unfortunately I also don't really see what fleeting-plugin-openstack could be doing differently – I guess it could keep track of which instances have been marked for deletion, and not report them from Update() (or report them as StateDeleting) even if OpenStack thinks they're still ACTIVE
? Would that help?
Relevant logs and/or screenshots
gitlab-runner debug logs
May 16 09:53:20 gitlab-runner[60308]: Job succeeded duration_s=1266.781438096 job=1273914 project=5083 runner=si2_DV96E
May 16 09:53:20 gitlab-runner[60308]: Appending trace to coordinator...ok code=202 job=1273914 job-log=0-2557 job-status=running runner=si2_DV96E sent-log=2285-2556 status=202 Accepted update-interval=3s
May 16 09:53:20 gitlab-runner[60308]: Updating job... bytesize=2557 checksum=crc32:7ebce19c job=1273914 runner=si2_DV96E
May 16 09:53:20 gitlab-runner[60308]: Submitting job to coordinator...ok bytesize=2557 checksum=crc32:7ebce19c code=200 job=1273914 job-status= runner=si2_DV96E update-interval=0s
May 16 09:53:20 gitlab-runner[60308]: Removed job from processing list builds=0 job=1273914 max_builds=2 project=5083 repo_url=https://[redacted] time_in_queue_seconds=2
May 16 09:53:20 gitlab-runner[60308]: required scaling change capacity-info={"InstanceCount":2,"MaxInstanceCount":3,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":1,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:20 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:20 gitlab-runner[60308]: instance marked for removal instance=94b38d55-9bf2-48e1-97c7-5e57d5547513 reason=instance exceeded max idle time runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: decreasing instances amount=1 group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: Instance deletion request successful cloud=openstack id=94b38d55-9bf2-48e1-97c7-5e57d5547513 name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:21.587Z
May 16 09:53:21 gitlab-runner[60308]: Decrease cloud=openstack instances=[94b38d55-9bf2-48e1-97c7-5e57d5547513] name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:21.587Z
May 16 09:53:21 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: ERROR: out-of-sync capacity actual=2 actualServers=[94b38d55-9bf2-48e1-97c7-5e57d5547513 4baef4b3-a0ee-4eb1-ac03-580367faae29] actualStates=[ACTIVE ACTIVE] cloud=openstack expected=1 name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:21.733Z
May 16 09:53:21 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: required scaling change capacity-info={"InstanceCount":1,"MaxInstanceCount":3,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":1,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:21 gitlab-runner[60308]: scale up n=1 rate-limit-n=1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: Feeding runners to channel builds=0 max_builds=2
May 16 09:53:22 gitlab-runner[60308]: Feeding runner to channel builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:22 gitlab-runner[60308]: Processing runner builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:22 gitlab-runner[60308]: Acquiring executor from provider builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:22 gitlab-runner[60308]: Acquiring job slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:22 gitlab-runner[60308]: Acquiring request slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:22 gitlab-runner[60308]: Checking for jobs...no content runner=si2_DV96E status=204 No Content
May 16 09:53:22 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: required scaling change capacity-info={"InstanceCount":2,"MaxInstanceCount":3,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":1,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=0 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: ERROR: out-of-sync capacity actual=1 actualServers=[4baef4b3-a0ee-4eb1-ac03-580367faae29] actualStates=[ACTIVE] cloud=openstack expected=2 name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:22.911Z
May 16 09:53:22 gitlab-runner[60308]: instance pruned group=openstack/openstack/fleeting-runners id=94b38d55-9bf2-48e1-97c7-5e57d5547513 lifetime=24m26.668742842s runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:22 gitlab-runner[60308]: increasing instances amount=1 group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:23 gitlab-runner[60308]: Instance creation request successful cloud=openstack id=1378f1eb-459d-4593-b711-52bba5f22e09 name=fleeting-runners runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:23.202Z
May 16 09:53:23 gitlab-runner[60308]: Increase cloud=openstack delta=1 name=fleeting-runners pre_instances=1 runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack succeeded=1 timestamp=2024-05-16T09:53:23.203Z
May 16 09:53:23 gitlab-runner[60308]: increasing instances response group=openstack/openstack/fleeting-runners num_requested=1 num_successful=1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:23 gitlab-runner[60308]: increase update group=openstack/openstack/fleeting-runners pending=1 requesting=0 runner=si2_DV96E subsystem=taskscaler total_pending=1
May 16 09:53:23 gitlab-runner[60308]: required scaling change capacity-info={"InstanceCount":2,"MaxInstanceCount":3,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":1,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:23 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:24 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:24 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:24 gitlab-runner[60308]: instance discovery cause=requested group=openstack/openstack/fleeting-runners id=1378f1eb-459d-4593-b711-52bba5f22e09 runner=si2_DV96E state=creating subsystem=taskscaler
May 16 09:53:24 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:24 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:25 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:25 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:25 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:25 gitlab-runner[60308]: Feeding runners to channel builds=0 max_builds=2
May 16 09:53:25 gitlab-runner[60308]: Feeding runner to channel builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:25 gitlab-runner[60308]: Processing runner builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:25 gitlab-runner[60308]: Acquiring executor from provider builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:25 gitlab-runner[60308]: Acquiring job slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:25 gitlab-runner[60308]: Acquiring request slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:25 gitlab-runner[60308]: Checking for jobs...no content runner=si2_DV96E status=204 No Content
May 16 09:53:25 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:26 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:26 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:26 gitlab-runner[60308]: Instance boot time not passed and cloud-init not finished boot_time=3e+10 cloud=openstack created=2024-05-16T09:53:23Z name=fleeting-runners runner=si2_DV96E server_id=1378f1eb-459d-4593-b711-52bba5f22e09 subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:26.697Z
May 16 09:53:26 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:26 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:27 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:27 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:27 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:27 gitlab-runner[60308]: Instance boot time not passed and cloud-init not finished boot_time=3e+10 cloud=openstack created=2024-05-16T09:53:23Z name=fleeting-runners runner=si2_DV96E server_id=1378f1eb-459d-4593-b711-52bba5f22e09 subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:27.874Z
May 16 09:53:27 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:28 gitlab-runner[60308]: Feeding runners to channel builds=0 max_builds=2
May 16 09:53:28 gitlab-runner[60308]: Feeding runner to channel builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:28 gitlab-runner[60308]: Processing runner builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:28 gitlab-runner[60308]: Acquiring executor from provider builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:28 gitlab-runner[60308]: Acquiring job slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:28 gitlab-runner[60308]: Acquiring request slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:28 gitlab-runner[60308]: Checking for jobs...no content runner=si2_DV96E status=204 No Content
May 16 09:53:28 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:28 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:28 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:29 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:30 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:31 gitlab-runner[60308]: Instance boot time not passed and cloud-init not finished boot_time=3e+10 cloud=openstack created=2024-05-16T09:53:23Z name=fleeting-runners runner=si2_DV96E server_id=1378f1eb-459d-4593-b711-52bba5f22e09 subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:31.455Z
[...snip til boot_time passes...]
May 16 09:53:51 gitlab-runner[60308]: Instance boot time not passed and cloud-init not finished boot_time=3e+10 cloud=openstack created=2024-05-16T09:53:23Z name=fleeting-runners runner=si2_DV96E server_id=1378f1eb-459d-4593-b711-52bba5f22e09 subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:51.931Z
May 16 09:53:51 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:52 gitlab-runner[60308]: Feeding runners to channel builds=0 max_builds=2
May 16 09:53:52 gitlab-runner[60308]: Feeding runner to channel builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:52 gitlab-runner[60308]: Processing runner builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:52 gitlab-runner[60308]: Acquiring executor from provider builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:52 gitlab-runner[60308]: Acquiring job slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:52 gitlab-runner[60308]: Acquiring request slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:52 gitlab-runner[60308]: Checking for jobs...no content runner=si2_DV96E status=204 No Content
May 16 09:53:52 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:52 gitlab-runner[60308]: running removal() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:52 gitlab-runner[60308]: running reconcile() group=openstack/openstack/fleeting-runners init=false runner=si2_DV96E subsystem=taskscaler
May 16 09:53:53 gitlab-runner[60308]: instance update group=openstack/openstack/fleeting-runners id=1378f1eb-459d-4593-b711-52bba5f22e09 runner=si2_DV96E state=running subsystem=taskscaler
May 16 09:53:53 gitlab-runner[60308]: running provision() group=openstack/openstack/fleeting-runners runner=si2_DV96E subsystem=taskscaler
May 16 09:53:53 gitlab-runner[60308]: Use address cloud=openstack ip_address=192.168.252.138 name=fleeting-runners network=cloudforms_network runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:53.173Z
May 16 09:53:53 gitlab-runner[60308]: SSH test result cloud=openstack name=fleeting-runners out=ok
May 16 09:53:53 gitlab-runner[60308]: runner=si2_DV96E subsystem=fleeting-plugin.fleeting-plugin-openstack timestamp=2024-05-16T09:53:53.596Z
May 16 09:53:53 gitlab-runner[60308]: ready instance=1378f1eb-459d-4593-b711-52bba5f22e09 runner=si2_DV96E subsystem=taskscaler took=524.114091ms
May 16 09:53:53 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:54 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:55 gitlab-runner[60308]: Feeding runners to channel builds=0 max_builds=2
May 16 09:53:55 gitlab-runner[60308]: Feeding runner to channel builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:55 gitlab-runner[60308]: Processing runner builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:55 gitlab-runner[60308]: Acquiring executor from provider builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:55 gitlab-runner[60308]: Acquiring job slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:55 gitlab-runner[60308]: Acquiring request slot builds=0 max_builds=2 runner=si2_DV96E
May 16 09:53:55 gitlab-runner[60308]: Checking for jobs...no content runner=si2_DV96E status=204 No Content
May 16 09:53:55 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:56 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
May 16 09:53:57 gitlab-runner[60308]: scale down n=-1 runner=si2_DV96E subsystem=taskscaler
Environment description
config.toml contents
concurrent = 2
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0
log_level = "debug"
[[runners]]
name = "runner-z"
url = "..."
id = 1850
token = "..."
token_obtained_at = 2024-05-14T13:21:43Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker-autoscaler"
[runners.custom_build_dir]
[runners.cache]
MaxUploadedArchiveSize = 0
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "..."
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
network_mtu = 1380
[runners.autoscaler]
capacity_per_instance = 1
max_use_count = 5
max_instances = 3
plugin = "fleeting-plugin-openstack"
[runners.autoscaler.plugin_config]
cloud = "openstack"
name = "fleeting-runners"
boot_time = "30s" # wait this long for cloud-init to finish, else mark the machine as ready anyway
[runners.autoscaler.plugin_config.server_spec]
name = "fleeting-runner-%d"
imageRef = "c05f8a19-6f17-4556-84a7-c8243cec43d3"
flavorRef = "2100" # m2.tiny
key_name = "runner-z-key" # SSH keypair
networks = [{ uuid = "d0e9eb87-e136-4216-b6ba-73040d2ffcd1" }] # cloudforms_network
security_groups = ["14f6209a-2ff2-4861-bdb2-43e6e947d302"] # cloudforms_ssh_in
[runners.autoscaler.connector_config]
username = "ubuntu"
key_path = "/etc/gitlab-runner/id_ed25519"
use_static_credentials = true
[[runners.autoscaler.policy]]
idle_count = 1
idle_time = "20m0s"
Used GitLab Runner version
$ gitlab-runner --version
Version: 16.11.1
Git revision: 535ced5f
Git branch: 16-11-stable
GO version: go1.21.9
Built: 2024-05-03T15:52:38+0000
OS/Arch: linux/amd64