Skip to content

Optimize GitLab Runners costs

This is follow up after the discussion here: https://gitlab.com/gitlab-com/infrastructure/issues/569 and https://gitlab.com/gitlab-com/infrastructure/issues/600.

Now

  1. We currently everywhere use 4G machines to have a parity with what our competitors do offer too (ex. Travis),
  2. We always clean-up machines after running builds to make sure that there's no data leakage between builds.

Proposals

We should improve the cost of running runners infrastructure by figuring out the best possible approach:

  1. Switch testing of GitLab CE/EE to use 2G machines if possible,
  2. Switch shared runners to always use 2G machines as default (optionally allow to choose 4G ones),
  3. Try to improve reusability of runners by introducing better control over docker-machine.
  4. Since the 40% of our builds are run for GitLab CE/EE we could build a separate shared runners just for them to be used by everyone (also contributors). These machines would have been configured to re-use machines, could be configured to use only 2G machines and you would require to use a runner tags in order to use them,

Effort required

  • The 1. require switching the provisioned machines to be 2G and it should not require anything else other then tweaking configuration.
  • The 2. requires proper acknowledge and can make it problematic to properly deliver this message.
  • The 3. requires development time and requires test rounds to ensure that we don't brake other builds.
  • The 4. requires building additional manager with different configuration and registering this manager to be used as shared runners manager.

Summary

I believe that the simplest to execute now and given the most of the cost saving is the option 4. We already have all bits required to configure the new managers. It would also allow us to easily test the new 2G runners and switch everyone to use them. I strongly believe that even if we stick with 4G we would still be able to reduce the cost of running gitlab-ce builds by improving the efficiency of testing forks. This would also allow us to reduce the "global capacity" given to shared runners to prefer the light users of CI.

Given that it will require about one week to prepare the switch and it should give us about 20-30% cost savings.

This would be a step before we could start working on switch to much better scaling solution, probably Kubernetes.