Docker Autoscaler GCP - race condition with instance group manager autorepair and preempted instances
Summary
When an instance is preempted by GCP which is managed by the instance group/gitlab runner manager there is a race condition about who is creating the new instance first.
If autorepair instance group manager is first then the instance metadata is missing the ssh key to execute the build on the runner leading to an error:
Running with gitlab-runner 16.3.1 (f5dfa4d1)
on *** tb8Hp4yH, system ID: ***
Resolving secrets
00:00
Preparing the "docker-autoscaler" executor
00:39
Dialing instance https://www.googleapis.com/compute/v1/projects/**/zones/europe-west1-b/instances/***...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: dial tcp ***:22: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/**/zones/europe-west1-b/instances/***...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: dial tcp ***:22: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/**/zones/europe-west1-b/instances/***...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: dial tcp ***:22: i/o timeout
Will be retried in 3s ...
ERROR: Job failed (system failure): preparing environment: dial ssh: dial tcp ***:22: i/o timeout
A workaround is to add retry
in .gitlab-ci.yml and
[runners.autoscaler]
capacity_per_instance = 1
max_use_count = 1
but I would like to increase the max_use_count
to reuse the runner more to have the pulled docker images already on disk.
unfortunately this would require also to incr the retry: max:
value leading to a lot of failed retries
.
Steps to reproduce
Stop an Instance via gcloud and wait till instance group manager autorepairs the instance. The repaired instance is missing the ssh key in the metadata, resulting into a failed job.
Environment description
config.toml contents
[runners.autoscaler]
capacity_per_instance = 1
max_use_count = 1
max_instances = 10
plugin = "fleeting-plugin-googlecompute"
[runners.autoscaler.plugin_config]
name = "***"
project = "**"
zone = "europe-west1-b"
[runners.autoscaler.connector_config]
protocol = "ssh"
username = "runner"
password = ""
key_path = ""
use_static_credentials = false
keepalive = "30s"
timeout = "10s"
use_external_addr = false
Used GitLab Runner version
Version: 16.3.1
Git revision: f5dfa4d1
Git branch: 16-3-stable
GO version: go1.20.5
Built: 2023-09-14T14:22:37+0000
OS/Arch: linux/amd64