idle_percentage isn't used in aws/runner/internal/ec2 module

What's the Problem?

Changes to var.idle_percentage seems to have no affect in module/aws/runner/internal/ec2.

Why do I think that?

I've been experimenting with different values for var.idle_percentage in GRIT while I've been performance testing for the Hosted Runners for GitLab Dedicated project (Transistor).

Through these experiments, I've found that changing the var.idle_percentage doesn't actually affect the behaviour of the runner manager and how it provisions instances from the ASG. This prompted me to trace the variable idlePercentage and how it's eventually passed through to GRIT, and I find that it isn't passed down after a certain module.

Here is a link to the Investigation Thread on the original issue: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/7185#note_2270069607. But in summary,

Summary

Assuming grit@v0.10.0, as this is the version that is vendored into Transistor:

  1. $RUNNER_MODEL: stack.idlePercentage begins it's life in the runner model
  2. validate-runner.libsonnet and provision-terraform.jsonnet: It's parsed through Jsonnet into
  3. aws/provision/config.auto.tfvars.json: This file is ingested into Terraform in the provision stage.
  4. aws/provision/, ingested in variables.tf#88 and is passed down in runner-manager.tf#43 to
  5. grit/modules/aws/runner/prod/, ingested in variables.tf#51 and it's validated here in runner.tf#29, but as you can see, it is not passed further down the module stack to ../internal/ec2.

I believe that idle_percentage should end up in the runner manager's config.toml as the scale_factor setting, based on my reading of https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runnersautoscalerpolicy-sections. The definition in Terraform of idle_percentage matches pretty closely to the definition of scale factor.

  • idle_percentage:
    • "The number of idle instances to maintain as a percentage of the current number of busy instances"
    • Source
  • scale_factor:
    • "The target idle capacity we want to be immediately available for jobs, on top of the idle_count, as a factor of the current in use capacity. Defaults to 0.0."
    • Source
Previous Theory

This means that the value of idlePercentage never makes it's way to the fleeting module, nor is it used in the config.toml runner configuration. This seems to be a bug in GRIT, and we can see potential evidence of that if we take a look at the grit/modules/aws/runner/internal/ec2/ module.

In grit/modules/aws/runner/internal/ec2/ec2.tf we can see that var.capacity_per_instance is being used in some calculations for:

- idle_count (#32) - concurrent (#26)

I believe that this is incorrect, and idlePercentage should be used in these calculations. Reading the advanced configuration documentation, we can see that capacity_per_instance is defined as:

> The number of jobs that can be executed concurrently by a single instance.

Which doesn't seem too related to the idle_count, and probably shouldn't be used it it's calculation. Conceptually, I makes more sense to me for these calculations to use var.idle_percentage in some way.

Why does this matter?

For Transistor and the Hosted Runners for GitLab Dedicated project, idlePercentage is one of the controllable "limits" we have to tune the performance of the underlying ASG. If idlePercentage actually has no effect, we will have to develop another lever to affect ASG scaling to balance cost to GitLab with performance for the customer.

Outcome of this ticket

The way I understand idlePercentage is that it defines the extra, idle instances to be provisioned as a percentage of active instances, to help the ASG scale to accommodate a burst in pipelines. So if idlePercentage == 50%, when 11 jobs are added to the queue and start to run, I expect 5 extra instances to be provisioned (~16 total running instances).

  • Is my assumption of how idle_percentage works correct?
  • Is this a "bug" in GRIT, or intended behaviour?
    • If it's intended behaviour, it seems I don't understand it, and I may need some help figuring it out.
Edited by Nick Skoretz