Occasional deletion of viable instance
Often Taskscaler will delete an instance that is healthy and capable of running jobs.
This seems to happen after the system has been stable for a while.
Some jobs arrive and when they finish, an instance that did not run a job is removed.
Even in violation of max_use_count=0
and idle_count
.
As soon as the instance is terminated, another is recreated to come back up to idle_count
.
I've reproduced this many times running locally.
And I think it accounts for some churn in production which causes an 18% over usage of dedicated hosts.
In the last 2 days, we've needed a maximum of 27 instances (13 and 14) for each runner manager.
This was determined by looking at ASG activity and the desired scale settings.
However there are 32 dedicated hosts which are not released.
Many are in a pending state (which we don't pay for) meaning their instance has recently been deleted.
Some are ready but unscheduled (wasting money).
Production is configured to use max_use_count=0
(default meaning unlimited) so instances shouldn't be deleted unless we are scaling down after a "24h" idle time.
Here is my config:
listen_address = "127.0.0.1:9090"
concurrent = 15
check_interval = 0
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "grit"
url = "https://gitlab.com/"
id = 0
executor = "instance"
[runners.instance]
allowed_images = ["*"]
[runners.autoscaler]
capacity_per_instance = 2
max_use_count = 0
max_instances = 2
plugin = "fleeting-plugin-aws"
[runners.autoscaler.plugin_config]
config_file = "/home/josephburnett/work/grit/aws-config"
credentials_file = "/home/josephburnett/work/grit/aws-credentials"
name = "main"
profile = "default"
[runners.autoscaler.connector_config]
protocol = "ssh"
username = "ec2-user"
password = ""
key_path = "/home/josephburnett/work/grit/ssh.pem"
use_static_credentials = false
keepalive = "30s"
timeout = "1h0m0s"
use_external_addr = true
[runners.autoscaler.vm_isolation]
enabled = true
nesting_host = "unix:///Users/ec2-user/Library/Application Support/nesting.sock"
image = "macos-13-xcode-14"
[runners.autoscaler.vm_isolation.connector_config]
username = "gitlab"
password = "gitlab"
key_path = ""
use_static_credentials = false
keepalive = "30s"
timeout = "1h0m0s"
use_external_addr = false
[[runners.autoscaler.policy]]
idle_count = 4
idle_time = "600s"
Here are the logs from one repro:
- i-0036adf2251275186 came online and did the jobe
- older, viable i-0adcbedfcfb6ffae9 was shut down afterward
- https://gitlab.com/josephburnett/hello-runner/-/pipelines/984533398
2023-08-28T14:54:03.993-0700 [INFO] ready: instance=i-0adcbedfcfb6ffae9 took=6m36.887813097s
Checking for jobs... received job=4968980820 repo_url=https://gitlab.com/josephburnett/hello-runner.git runner=jF8GVQjYP
Added job to processing list builds=1 job=4968980820 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
Preparing instance... job=4968980820 project=39258790 runner=jF8GVQjYP
Dialing instance external-address=44.195.78.101 instance-id=i-0036adf2251275186 internal-address=10.0.0.227 job=4968980820 project=39258790 runner=jF8GVQjYP use-external-address=true
Appending trace to coordinator...ok code=202 job=4968980820 job-log=0-622 job-status=running runner=jF8GVQjYP sent-log=0-621 status=202 Accepted update-interval=1m0s
Created nesting VM nesting-m3nbo8v9 192.168.64.14 job=4968980820 project=39258790 runner=jF8GVQjYP
Job succeeded duration_s=24.432393413 job=4968980820 project=39258790 runner=jF8GVQjYP
Appending trace to coordinator...ok code=202 job=4968980820 job-log=0-1768 job-status=running runner=jF8GVQjYP sent-log=622-1767 status=202 Accepted update-interval=1m0s
Updating job... bytesize=1768 checksum=crc32:a02b1acd job=4968980820 runner=jF8GVQjYP
Submitting job to coordinator...accepted, but not yet completed bytesize=1768 checksum=crc32:a02b1acd code=202 job=4968980820 job-status= runner=jF8GVQjYP update-interval=1s
Updating job... bytesize=1768 checksum=crc32:a02b1acd job=4968980820 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=1768 checksum=crc32:a02b1acd code=200 job=4968980820 job-status= runner=jF8GVQjYP update-interval=0s
Removed job from processing list builds=0 job=4968980820 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
2023-08-28T14:57:22.052-0700 [INFO] decreasing instances: amount=1 group=aws/us-east-1/main
2023-08-28T14:57:22.381-0700 [INFO] instance update: group=aws/us-east-1/main id=i-0adcbedfcfb6ffae9 state=deleting
Here is another repo which terminated both viable instances:
2023-08-28T11:15:02.325-0700 [INFO] plugin initialized: version=dev build info="sha=HEAD; ref=HEAD; go=go1.20.1; built_at=now; os_arch=linux/amd64"
2023-08-28T11:15:02.482-0700 [INFO] instance discovery: group=aws/us-east-1/main id=i-0e7d658a51b42a4f3 state=running cause=preexisted
2023-08-28T11:15:03.682-0700 [INFO] increasing instances: amount=1 group=aws/us-east-1/main
2023-08-28T11:15:03.781-0700 [INFO] ready: instance=i-0e7d658a51b42a4f3 took=1.298068459s
2023-08-28T11:15:03.932-0700 [INFO] increasing instances response: group=aws/us-east-1/main num_requested=1 num_successful=1
2023-08-28T11:15:03.932-0700 [INFO] increase update: group=aws/us-east-1/main pending=1 total_pending=1 requesting=0
2023-08-28T11:15:13.949-0700 [INFO] instance discovery: group=aws/us-east-1/main id=i-0aed2f21758963cd7 state=creating cause=requested
2023-08-28T11:22:06.351-0700 [INFO] ready: instance=i-0aed2f21758963cd7 took=6m35.667653673s
Checking for jobs... received job=4967909881 repo_url=https://gitlab.com/josephburnett/hello-runner.git runner=jF8GVQjYP
Added job to processing list builds=1 job=4967909881 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
Preparing instance... job=4967909881 project=39258790 runner=jF8GVQjYP
Dialing instance external-address=44.200.237.222 instance-id=i-0e7d658a51b42a4f3 internal-address=10.0.0.151 job=4967909881 project=39258790 runner=jF8GVQjYP use-external-address=true
Appending trace to coordinator...ok code=202 job=4967909881 job-log=0-622 job-status=running runner=jF8GVQjYP sent-log=0-621 status=202 Accepted update-interval=1m0s
Updating job... bytesize=622 checksum=crc32:25880a4c job=4967909881 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=622 checksum=crc32:25880a4c code=200 job=4967909881 job-status= runner=jF8GVQjYP update-interval=0s
Updating job... bytesize=622 checksum=crc32:25880a4c job=4967909881 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=622 checksum=crc32:25880a4c code=200 job=4967909881 job-status= runner=jF8GVQjYP update-interval=0s
Updating job... bytesize=622 checksum=crc32:25880a4c job=4967909881 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=622 checksum=crc32:25880a4c code=200 job=4967909881 job-status= runner=jF8GVQjYP update-interval=0s
Created nesting VM nesting-fqyzi4cj 192.168.64.2 job=4967909881 project=39258790 runner=jF8GVQjYP
Appending trace to coordinator...ok code=202 job=4967909881 job-log=0-814 job-status=running runner=jF8GVQjYP sent-log=622-813 status=202 Accepted update-interval=1m0s
Job succeeded duration_s=288.594536158 job=4967909881 project=39258790 runner=jF8GVQjYP
Appending trace to coordinator...ok code=202 job=4967909881 job-log=0-1767 job-status=running runner=jF8GVQjYP sent-log=814-1766 status=202 Accepted update-interval=1m0s
Updating job... bytesize=1767 checksum=crc32:3cf14dcf job=4967909881 runner=jF8GVQjYP
Submitting job to coordinator...accepted, but not yet completed bytesize=1767 checksum=crc32:3cf14dcf code=202 job=4967909881 job-status= runner=jF8GVQjYP update-interval=1s
Updating job... bytesize=1767 checksum=crc32:3cf14dcf job=4967909881 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=1767 checksum=crc32:3cf14dcf code=200 job=4967909881 job-status= runner=jF8GVQjYP update-interval=0s
Removed job from processing list builds=0 job=4967909881 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
2023-08-28T11:29:31.990-0700 [INFO] decreasing instances: amount=1 group=aws/us-east-1/main
2023-08-28T11:29:32.307-0700 [INFO] instance update: group=aws/us-east-1/main id=i-0aed2f21758963cd7 state=deleting <<<<<<======= WHY? ========
Checking for jobs... received job=4967939137 repo_url=https://gitlab.com/josephburnett/hello-runner.git runner=jF8GVQjYP
Added job to processing list builds=1 job=4967939137 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
Preparing instance... job=4967939137 project=39258790 runner=jF8GVQjYP
Checking for jobs... received job=4967939139 repo_url=https://gitlab.com/josephburnett/hello-runner.git runner=jF8GVQjYP
Added job to processing list builds=2 job=4967939139 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
Preparing instance... job=4967939139 project=39258790 runner=jF8GVQjYP
Dialing instance external-address=44.200.237.222 instance-id=i-0e7d658a51b42a4f3 internal-address=10.0.0.151 job=4967939137 project=39258790 runner=jF8GVQjYP use-external-address=true
Dialing instance external-address=44.200.237.222 instance-id=i-0e7d658a51b42a4f3 internal-address=10.0.0.151 job=4967939139 project=39258790 runner=jF8GVQjYP use-external-address=true
Appending trace to coordinator...ok code=202 job=4967939137 job-log=0-622 job-status=running runner=jF8GVQjYP sent-log=0-621 status=202 Accepted update-interval=1m0s
Appending trace to coordinator...ok code=202 job=4967939139 job-log=0-622 job-status=running runner=jF8GVQjYP sent-log=0-621 status=202 Accepted update-interval=1m0s
Created nesting VM nesting-rln9z9wa 192.168.64.3 job=4967939139 project=39258790 runner=jF8GVQjYP
Created nesting VM nesting-vniwss6p 192.168.64.4 job=4967939137 project=39258790 runner=jF8GVQjYP
Job succeeded duration_s=28.901924288 job=4967939137 project=39258790 runner=jF8GVQjYP
Appending trace to coordinator...ok code=202 job=4967939137 job-log=0-1767 job-status=running runner=jF8GVQjYP sent-log=622-1766 status=202 Accepted update-interval=1m0s
Updating job... bytesize=1767 checksum=crc32:cc44fd55 job=4967939137 runner=jF8GVQjYP
Job succeeded duration_s=28.805082816 job=4967939139 project=39258790 runner=jF8GVQjYP
Submitting job to coordinator...accepted, but not yet completed bytesize=1767 checksum=crc32:cc44fd55 code=202 job=4967939137 job-status= runner=jF8GVQjYP update-interval=1s
Appending trace to coordinator...ok code=202 job=4967939139 job-log=0-1767 job-status=running runner=jF8GVQjYP sent-log=622-1766 status=202 Accepted update-interval=1m0s
Updating job... bytesize=1767 checksum=crc32:0387d201 job=4967939139 runner=jF8GVQjYP
Submitting job to coordinator...accepted, but not yet completed bytesize=1767 checksum=crc32:0387d201 code=202 job=4967939139 job-status= runner=jF8GVQjYP update-interval=1s
Updating job... bytesize=1767 checksum=crc32:cc44fd55 job=4967939137 runner=jF8GVQjYP
Updating job... bytesize=1767 checksum=crc32:0387d201 job=4967939139 runner=jF8GVQjYP
Submitting job to coordinator...ok bytesize=1767 checksum=crc32:cc44fd55 code=200 job=4967939137 job-status= runner=jF8GVQjYP update-interval=0s
Removed job from processing list builds=1 job=4967939137 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
Submitting job to coordinator...ok bytesize=1767 checksum=crc32:0387d201 code=200 job=4967939139 job-status= runner=jF8GVQjYP update-interval=0s
Removed job from processing list builds=0 job=4967939139 max_builds=15 project=39258790 repo_url=https://gitlab.com/josephburnett/hello-runner.git
2023-08-28T11:31:58.928-0700 [INFO] decreasing instances: amount=1 group=aws/us-east-1/main
2023-08-28T11:31:59.272-0700 [INFO] instance update: group=aws/us-east-1/main id=i-0e7d658a51b42a4f3 state=deleting <<<<<<======= WHY? ========
Discovered while testing Mac Runner failure scenarios: gitlab-org/ci-cd/shared-runners/infrastructure#41 (closed)