Skip to content

docker-autoscaler auth issues

I'm using the experimental docker-autoscaler executor, and things like scaling, running a job etc work fine. Except for image that need docker authentication.

We are using ECR for our images, so we have added the ecr credential helper and have adjusted the config.json from docker to use it. This works perfectly fine when logging into the EC2 instance and running a docker pull for the an image on ECR.

When launching a new job though using that executor this fails:

ERROR: Job failed: failed to pull image "0123456789012.dkr.ecr.eu-central-1.amazonaws.com/cicd:latest" with specified policies [if-not-present]: Error response from daemon: Head "https://0123456789012.dkr.ecr.eu-central-1.amazonaws.com/v2/cicd/manifests/latest": no basic auth credentials (manager.go:237:0s)

When the image is pulled on the instance before, this of course works fine, and it indicates that its using root as a user.

There is a few points that I tried to debug, but are not entirely clear to me:

Usually the credentials helper configuration in docker is a client configuration. But in this case with the docker-autoscaler its not clear to me how docker is being invoked.

I have a few guesses / theories on how this currently works:

Option 1:

Will the plugin use the configured user and invoke the docker binary, and in this case the config /home/myuser/.docker/config.json should be read and used

Option 2:

Same as above, but the plugin uses a different user like root to ensure it has privileges on the socket, and therefore this users config is read?

Option 3:

The docker binary on the EC2 instance is never called, but the autoscaler runner, uses a docker client / library on its host and therefore needs to know about the login there?

Any hints on how that works to aid in debugging would be greatly appreciated.

additional information

config.toml of runner container

concurrent = 10
listen_address = ":8083"
check_interval = 0
log_format = "json"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "workload-isolation-runner"
  output_limit = 20480
  url = "http://my.gitlab.example.com"
  id = 1412220
  token = "MY_TOKEN"
  executor = "docker-autoscaler"

  [runners.docker]
    tls_verify = false
    image = "123456789012.dkr.ecr.eu-central-1.amazonaws.com/cicd:latest"
    dns = ["169.254.169.253"]
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/var/run/docker.sock:/var/run/docker.sock"]
    shm_size = 0
    host = "unix:///var/run/docker.sock"
    pull_policy = ["if-not-present"]

  [runners.autoscaler]
    plugin = "fleeting-plugin-aws"
    capacity_per_instance = 1
    max_use_count = 0 # TODO change back to 1
    max_instances = 10

    [runners.autoscaler.plugin_config]
        name = "MYASG_NAME" # AWS Autoscaling Group name

   [runners.autoscaler.connector_config]
        username = "ubuntu"
        use_external_addr = true

    [[runners.autoscaler.policy]]
        idle_count = 1
        idle_time = "20m0s"
gitlab-runner --version
Version:      16.0.0
Git revision: 3cc4d81a
Git branch:   16-0-stable
GO version:   go1.19.9
Built:        2023-05-22T14:09:02+0000
OS/Arch:      linux/amd64
fleeting-plugin-aws --version
Name:         fleeting-plugin-aws
Version:      v0.3.0
Git revision: eb203580
Git ref:      refs/pipelines/871459875
GO version:   go1.19.6
Built:        2023-05-18T11:13:39+0000
OS/Arch:      linux/amd64

version and configuration on the runner EC2 instances that in the auto scaling group

Docker version 24.0.1, build 6802122
cat /root/.docker/config.json
{
        "credsStore": "ecr-login"
}
cat /home/ubuntu/.docker/config.json
{
        "credsStore": "ecr-login"
}
Edited by Andreas Sieferlinger