docker-autoscaler and docker+machine fail to fetch incrementally

Summary

Since gitlab-runner 17.10, we seem to hit consistent issues with git fetch strategy. 17.9.0 works. I have seen this issue with every version between 18.1 and 17.9.3. I haven't tested 17.9.2-17.9.1

fatal: missing blob object 'XXXX'
error: remote did not send all necessary objects

This issue persists with retries feature flag enabled, across two different projects.

This issue signature is at least similar to #28945 (closed) but with different timeframes when the issue emerged.

Steps to reproduce

The best I have figured out so far for concrete reproduction steps is:

Have a runner node fetch a merged-result SHA from a merged result MR (passes)
Once the same node tries to fetch a merged-result SHA from a second merged result MR, we hit the fatal error above.
This may be specifically when we have two different CI_MERGE_REQUEST_TARGET_BRANCH_SHA targets, but can't say for sure despite best efforts.

Actual behavior

Gitlab runner will randomly fail to fetch

Expected behavior

Gitlab runner fully fetches

Relevant logs and/or screenshots

job log

fatal: missing blob object 'XXXX'
error: remote did not send all necessary objects

Environment description

Custom installation for my org (can DM). We're a large ultimate customer.

Docker+machine (latest), docker-autoscaler, versions 18.1.1 -> 17.10

docker 25.0.8-1.amzn2023.0.4

Can also say that 17.10 K8s executor does not exhibit this issue

config.toml contents

[[runners]]
  name = "autoscale-1c"
  limit = 300
  url = "https://gitlab.XX.com"
  id = 42825
  token = "XXXX"
  token_obtained_at =
  token_expires_at =
  executor = "docker-autoscaler"
  [runners.cache]
    Type = "s3"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      ServerAddress = "XX"
      AccessKey = "XX"
      SecretKey = "XX"
      BucketName = "XX"
      BucketLocation = "XX"
  [runners.feature_flags]
    FF_USE_FLEETING_ACQUIRE_HEARTBEATS = true
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    cap_add = ["SYS_ADMIN"]
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/builds:/some-other-dir"]
    pull_policy = ["if-not-present"]
    shm_size = 0
    network_mtu = 0
    [runners.docker.ulimit]
      nofile = "2500"
  [runners.autoscaler]
    capacity_per_instance = 1
    max_use_count = 50
    max_instances = 300
    plugin = "aws:latest"
    update_interval = "10s"
    update_interval_when_expecting = "0s"
    [runners.autoscaler.plugin_config]
      config_file = "/home/XX/.aws/config"
      credentials_file = "/home/XX/.aws/credentials"
      name = "XX"
      profile = "default"
    [runners.autoscaler.connector_config]
      protocol_port = 22
      username = "ec2-user"
      keepalive = "0s"
      timeout = "0s"
      use_external_addr = true

    [[runners.autoscaler.policy]]
      idle_count = 0
      idle_time = "30s"
      scale_factor = 0.0
      scale_factor_limit = 0
    [runners.autoscaler.state_storage]
      enabled = true

Used GitLab Runner version

Version:      18.1.1
Git revision: 2b813ade
Git branch:   18-1-stable
GO version:   go1.24.4 X:cacheprog
Built:        2025-06-26T16:25:31Z
OS/Arch:      linux/amd64

Possible fixes

Downgrade to 17.9.0 has worked

Edited Jul 10, 2025 by Dustin Gardner