"no data on pre-existing instance so removing for safety" after pod replacement
We are running gitlab-runner in Kubernetes via the Helm chart and specifying the docker-autoscaler executor. When the pods are replaced, via deploy or disruption, we see CI jobs hang and eventually timeout. On further inspection, it looks like the EC2 instances are being removed, the ASG is scaled to 0, the job pods are killed, but jobs still appear as running in GitLab until they timeout.
2024-10-10T20:48:51.17466353Z stderr F {"err":"no data on pre-existing instance so removing for safety","instance":"i-07e72503a6756951f","level":"error","msg":"ready up preparation failed","runner":"qMC6xCsn1","subsystem":"taskscaler","time":"2024-10-10T20:48:51Z","took":322101332}
It seems using RollingUpdate strategy is not compatible with the fleeting plugin, since two runner pods are fighting over ASG and EC2 instances, unaware of the other. Here we see the overlap of the two pods by cpu usage.