Skip to content

Prevent new autoscaler thrashing instances

Arran Walker requested to merge 29431-thrashing into main

What does this MR do?

The change first uses taskscaler.Reserve() to reserve capacity, and once we have a job, calls taskscaler.Acquire() in the wrapped executor returned by the executor provider. This ensures that when idle scaling is enabled, we only accept jobs once we've confirmed there's capacity and have reserved it.

Why was this MR needed?

Previously, we were acquiring a capacity and then releasing it if there was no job. But this thrashed VMs if taskscaler had been configured to remove capacity after it has been used.

What's the best way to test this MR?

Test with various scaling rules, for example:

  [runners.autoscaler]
    capacity_per_instance = 2
    max_use_count = 2
    max_instances = 5
    plugin = "fleeting-plugin-aws"

    [[runners.autoscaler.policy]]
      idle_count = 4
      idle_time  = "20m"

    [runners.autoscaler.connector_config]
      username = "ubuntu"
      timeout = "10m"

    [runners.autoscaler.plugin_config]
      name = "<autoscaler name>"
      region = "us-west-2"

What are the relevant issue numbers?

Closes #29431 (closed)

Merge request reports