Concurrent doesn't limit the number of docker-machine VMs created

Summary

In #2611 (comment 801403416) I related a concern raised by https://gitlab.my.salesforce.com/00161000004zoBW; that concurrent wasn't behaving as documented.

It turns out, setting the concurrent global does limit the parallelism of overall job execution for a given runner-manager, but does not explicitly limit the number of VMs created by docker-machine.

Steps to reproduce

  1. Create sustained job demand at or above the value of concurrent
  2. Register multiple [[runners]] (example config.toml provided below.
  3. Observe the number of docker+machine VMs created exceed the stipulated concurrent limit

Described in the attached videos.

  • Video 1: Context & Setup Description

screencap-1

  • Video 2: Watch a pipeline that creates sustained parallel job demand spawn more VMs than concurrent implies

screencap-2

config.toml
concurrent = 5
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "k2-dm"
  url = "https://gitlab.jreid.dev"
  token = "redacted"
  executor = "docker+machine"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.machine]
    IdleCount = 1
    IdleScaleFactor = 0.0
    IdleCountMin = 0
    MachineDriver = "virtualbox"
    MachineName = "dm-as-1-%s"

[[runners]]
  name = "k2-dm-2"
  url = "https://gitlab.jreid.dev"
  token = "redacted"
  executor = "docker+machine"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.machine]
    IdleCount = 1
    IdleScaleFactor = 0.0
    IdleCountMin = 0
    MachineDriver = "virtualbox"
    MachineName = "dm-as-2-%s"

[[runners]]
  name = "k2-dm-3"
  url = "https://gitlab.jreid.dev"
  token = "redacted"
  executor = "docker+machine"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.machine]
    IdleCount = 1
    IdleScaleFactor = 0.0
    IdleCountMin = 0
    MachineDriver = "virtualbox"
    MachineName = "dm-as-3-%s"
.gitlab-ci.yml
stages:
    - one
    # - two
stage-1-job-1:
    stage: one
    script:
        - echo "hello world"
        - sleep 180
        - echo "goodbye world"
stage-1-job-2:
    stage: one
    script:
        - echo "hello again"
        - sleep 180
        - echo "seeya"
stage-1-job-3:
    stage: one
    script:
        - echo "hello world"
        - sleep 120
        - echo "goodbye world"
stage-1-job-4:
    stage: one
    script:
        - echo "hello again"
        - sleep 120
        - echo "seeya"
stage-1-job-5:
    stage: one
    script:
        - echo "hello again"
        - sleep 120
        - echo "seeya"
stage-1-job-6:
    stage: one
    script:
        - echo "hello again"
        - sleep 120
        - echo "seeya"
stage-1-job-7:
    stage: one
    script:
        - echo "hello again"
        - sleep 60
        - echo "seeya"
stage-1-job-8:
    stage: one
    script:
        - echo "hello again"
        - sleep 60
        - echo "seeya"
stage-1-job-9:
    stage: one
    script:
        - echo "hello again"
        - sleep 60
        - echo "seeya"
stage-1-job-10:
    stage: one
    script:
        - echo "hello again"
        - sleep 60
        - echo "seeya"

Actual behaviour

The docker+machine autoscaler creates additional VMs which aren't immediately handed jobs (as the number of concurrent jobs allowable has already been reached).

Expected behaviour

Up for debate. On one hand, it's almost convenient to have a "warmed up" VM ready to accept an additional job, and could be a job-wait-time reduction, particularly if the machine type is different from that of an about-to-finish job.

Example:

concurrent = 4

  • Active jobs t2.mediums = 3 t2.larges = 1

  • Pending jobs t2.mediums = 0 t2.larges = 2

Having up to three t2.large docker+machine VMs spun up, two of which will just be waiting to accept pending jobs for an indeterminate amount of time, will definitely save VM startup time for the pending jobs as the active jobs complete.

On the other hand, this is somewhat unexpected if you're relying upon concurrent to limit the total number of VMs active; for example, if you're managing a limited amount of IP Address space for a given region or subnet.

Used GitLab Runner version

arch=amd64 os=linux pid=141019 revision=5316d4ac version=14.6.0
Edited by Pedro Pombeiro