Requirements gathering for CI job autoscaling

Overview

The main goal behind this issue is going to be to find out as much information as possible on how our customers are currently using GitLab Runner to autoscale infrastructure, and what the pain points are. This issue will not come up with a solution for all the problems but to have an SSOT of all the requirements that we need to meet. There might be multiple scenarios/stories that we need to address with multiple solutions, first, we need to find the scenario/stories that our users have with using our product to autoscale.

Final Design requirements

With all the information gathered from below and the interviews set these are the high-level requirements that are needed for solution/s to support our customer needs.

Usability

  • Easy to set up, with a few commands, then allow users to grow depending on the scale
  • Ideally self-hosted and GitLab.com shared runners should use something similar with some extra security layers for Dogfooding
  • Container-based system to provide maximum portability and allow customization.
  • Cost-effective on running gitlab-runner.
  • Cost-effective running user job.
  • Provide an easy upgrade path for the runner fleet.
  • HA out of the box.
  • Provide a frequent cleanup process/hooks to prevent filling up the disk.
  • Don't pick up jobs if we don't have resources to run the jobs.

Security and reliability

  • Able to stop bitcoin miners in an automated fashion
  • CPU/Memory Limit
  • Network/Bandwidth Limit
  • Each job is isolated both on a process and network-level (depending on multi-tenancy requirements)
  • Multi-tenant, able to run untrusted code from multiple users.
  • Based on a locked-down host OS to reduce security footprint.
  • Not susceptible to noisy neighbors.
  • Doesn't have a large security footprint when compared to the current solutions.
  • Observability
    • Able to tell what binaries/syscall the user is executing
    • Able to tell what kind of network activity is happening for the user job
  • Fail closed, any unexpected behavior, failure terminate the job.
  • Multiple layers of security to have Defence in depth.
  • Have privileged and unprivileged containers in a separate set up since they need different requirements and will also reduce the blast radius if we have to mitigate some issues with privileged containers.
  • Ability to roll out kernel fixes and security patches in a fast and reliable manner.
  • Ephemeral gitlab-runner managers to prevent long-running processes.

Current state

Linux

Docker machine

  • Depending on docker-machine which is maintenance mode.
  • Keeping our own fork to keep it alive for our needs.
  • Used for GitLab.com instance runners to create a VM per job.
  • docker-machine is really useful because it's a simple way to create a VM on a cloud provider with docker preinstalled and then the runner communicates with docker over the API.
  • It allows our users to easily autoscale new machines on the cloud provider depending on the number of jobs that the runner has picked up.
  • It allows users to scale down/up the number of idle machines depending on time thanks to [[runners.machine.autoscaling]]
  • It allows the user to customize the type of VM that is created with all the docker-machine options.
Where it lacks
  • Have a good amount of feature requests/bug fixes in our own fork https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/issues
    • more AWS support
    • more GCP support
    • more baseOS support
  • We built our own custom autoscaling algorithm which requires a lot of maintenance.
  • We have some bugs in our autoscaler that are causing money for users https://gitlab.com/gitlab-org/gitlab-runner/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=executor%3A%3Adocker-machine
  • Lack of resiliency when it comes to region outages #26447 (closed). There is no way to stop a runner from picking up jobs and scheduling machines on an affected region automatically.
  • All the state of docker-machine is handled locally, and there is no reconsilation loop so when the state on the cloud provider changes the state doesn't change locally which causes slip brain issues.
  • Having the runner team be an expert for each cloud provider.
  • Looking at all the issues surrounding docker-machine https://github.com/docker/machine/issues there's a ton of support/feature requests for other cloud providers that potentially we want to support as well.
  • Support for bare servers through OpenStack/Vmware drivers only.
  • GitLab Runner requires some privileged credentials for the cloud provider to create/delete machines on the cloud provider making it a risker point of failure.
  • It provides autoscaling but not HA, which requires users to roll out their own HA setup.
  • Setting up new runner managers will result in race conditions with certs creation.
  • Users end up having failed jobs because they run out of disk space on the machine.

Kubernetes

  • Autoscaler depends on the cloud provider.
  • Consistent interface between each cloud provider.
  • Uses containers for isolation instead of a VMs which provider quicker start times and better bin packing
  • All major cloud providers provide Kubernetes as a service
  • Can deploy on bare metal servers/self-hosted infrastructure whilst still providing the same interface.
Where it lacks
  • Hard for customers to get started on their own because they require a lot of kubernetes knowledge.
  • We provide no guidance on how to set it up, which leads to a lot of problems for users.
  • Most users don't define CPU/mem requests/limits which leads to autoscaling not working as expected.
  • When gitlab-runner picks up a new job it automatically schedules it to the cluster. It might be the case that the cluster doesn't have the capacity so the Pod stays scheduling for a long time. This leads to a job staying in running (eating up runner minutes and job timeout minutes) because we are just waiting for the pod to be scheduled then eventually timing out either because of poll_timeout.
  • Requires to turn on privileged to build containers opening up the whole node/cluster to escalation issues not just the single VM unlike docker-machine
  • When autoscaling is configured it might be the case that a node is removed because only 1 pod is scheduled on it, which results in a job failure. We don't implement any mechanism to prevent this. Causing flaky jobs for users that have to manually retried, or retry on runner_system_failure if we report it like that error (need to check this)
  • GitLab cloud-native chart installs GitLab Runner on the same cluster as GitLab
  • Doesn't work on OpenShift out of the box

Using cloud provider autoscaler

  • Create a base image with gitlab-runner and required tools installed.
  • Create an Autoscaling group for example in AWS/GCP.
  • Use docker executor.
  • The cloud provider does the autoscaling which has a lot more resources and knowledge about autoscaling than we have in grouprunner team.
  • Works really well for self-hosted customers where isolation is not a concern.
  • Already deployed by customers.
  • Requires less overhead/maintainability from our side when compared to docker-machine especially with terraform/cloud formation tools where everything is abstracted
Where it lacks
  • No guidance from us, however, this is changing with https://gitlab.com/gitlab-org/ci-cd/distribution/runner/self-hosted for GCP and we can extend it for other cloud providers.
  • Each cloud provider have their own autoscaler logic and set of features.
  • runner doesn't provide all the metrics that are needed for the autoscaler to prevent and machine termination.
  • Some bin packing issues such as #4338 (comment 384827009) where the autoscaler is not workload aware since we aren't just serving traffic.
  • Not all cloud providers support autoscaling.
  • When a machine is terminated most of the time it might kill a job that is running causing the job to fail.
  • When a machine is terminated most of the time it might end up not unregistering the runner which leads to a bunch of dead runners registered.
  • A runner can pick up more jobs than it can handle, unlike docker-machine where it picks up jobs depending on the number of machines and idle machines.
Cloud provider autoscaling compared to `docker-machine
feature docker-machine Azure GCP AWS Digital Ocean Alibaba Cloud IBM OpenStack Nomad
scheduled autoscale
idle machine count
custom metric based scaling
vertical autoscale
alerts when autoscale triggers
multiple availability zones
healthchecks
multiple instance types
graceful termination
scaling cooldowns
parallel autoscaling
max lifetime of machine
legend
  • scheduled autoscale: Scale in-out at a specific time.
  • idle machine count: Machine sitting idle waiting to pick up jobs.
  • custom metric-based scaling: Run a specific query for example Prometheus to scale in-out.
  • vertical autoscale: Increase the CPU/MEM/IO of a single machine.
  • alerts when autoscale triggers: Can see when cluster scales in-out with some kind of alert or audit log.
  • multiple availability zones: Allow to scaling in-out into multiple AZ without setting 2 autoscaling setups.
  • health checks: Have a way to auto-heal an instance by pinging a specific endpoint, if it's down restart the machine.
  • multiple instance types: Allow the autoscaling group to scale in-out multiple instance types all at once.
  • graceful termination policy: When you scale in a group you have control over how the machine is terminated so it doesn't remove a machine that is running a job.
  • scaling cooldowns: When a scale in/out happens wait X amount of time until the next change to prevent trashing on instances.
  • parallel autoscaling: Can scale in/out multiple instances at once.
  • max lifetime of machine: After X amount of time terminates the instance.

Windows

  • Using autoscaler for Windows on GCP only.
  • In beta at the moment.

macOS

  • No autoscaling provided for macOS runners.
  • Building platform on top orka to support autoscaling in &1830

Kubernetes vs docker-machine feature comparison

Feature Kubernetes docker-machine Needed for GitLab.com
Kaniko builds 👍
git fetch #3847 (closed) 👎
artifacts 👍
cache 👍
services 👍
services alises 👍
dind builds 👍
Windows 👍
VM isolation Depends if we allow privileged containers
Docker ENTRYPOINT execution #4125 (closed) 👍
pwsh support #13145 (closed) #4021 (closed) 👍

Using Docker to build images on GitLab.com

Right now we allow running privileged containers inside of GitLab.com infrastructure so that users can build their own docker images. This opens up our platform to a lot of attack surface because we don't get all the benefits and security from containers. This is why are creating a VM per job so that there is complete isolation.

What are competitors doing:

  • Travis CI - You specify a servcies with docker and it's exposed to you if no service is defined docker is not started
  • cicleci
    • Using machine executor (first example), will create a VM specify for you so you have access to docker.
    • Using docker executor (second example), will set up a docker server somewhere else so there is no need for privileged containers
  • semaphore - Access to a raw VM is given
  • buddy - You just specify the action and it will build it for you, fairly hidden what goes behind the scenes.
  • Jenkins - Multiple ways to do this, but the common way is similar to us, with privileged containers or sharing the docker socket.
  • GitHub Actions - Access to a raw VM is given.

Different infrastructure to autoscale

Cloud

  • GCP (Google Cloud Platform)
  • AWS (Amazon Web Services)
  • Azure
  • DO (DigitalOcean)
  • Alibaba Cloud
  • IBM

Bare metal

  • VMware
  • Hyper-V
  • OpenStack
  • Nomad

runner deployment methods

Self Hosted - Using their own Data Center

Requirements

  • Creating VMs using a hypervisor
  • Usually air-gapped environments
  • Sometimes managed by OpenStack
  • Running trusted code
  • Autoscaling using OpenStack/Hypervisor features.

Our solutions

  • Kubernetes executor
  • Use nomad/OpenStack for autoscaling, no guidance provided by us.

Self Hosted - Using Kubernetes

Requirements

  • Experience with kubernetes varies from beginner to expert
  • gitlab-runner and jobs run inside kubernetes
  • Autoscaling both from a cluster and gitlab-runner perspective
  • Abstracting kubernetes primitives away from the user

Our solution

  • Install Kubernetes using our Helm chart
  • Install Kubernetes using Operator

Self Hosted - Using cloud providers to host VMS

Requirements

  • Easy to manage the GitLab Runner cluster.
  • Guidance from us on how to use the cloud provider to set up GitLab Runner for HA.
  • Cost-effective, now wasted resources (autoscaling).
  • Possibly docker-in-docker support but isolation isn't a requirement for most users.

Our solution

  • docker-machine execuotr
  • kubernetes executor
  • Use cloud provider autoscaler

GitLab.com shared runners - unprivileged containers

  • Run job on GitLab.com using the default shared runners
  • Specify your own image
  • Container-based solution to use images
  • No maintenance for the user
  • They can do anything that they can usually do on their local machine

GitLab.com shared runners - privileged containers

Missing features from the current state

  • Abusive actions can be terminated immediately.
  • Easily update cluster/agent/rules for security issues.
  • Prevent any bitcoin miners from using CI with BPF traces
  • Network/Bandwidth Limit
  • Run jobs on underutilized runners
  • When using cloud API's we end up getting rate limited because we are sending too many requests.
  • Binary tracers, block certain name of binaries which will block the script kiddie
  • Ability to block network for certain endpoints
  • Lack of visibility on what is going on each of our machines, how do we know if they are using to run a legit job?
  • Alerting if the user escaped from the container, and got privileged access to the machine
  • env prints every information such as machine of the IP
  • If we don't hear from a machine for a while we don't alert/kill it.
  • Allow users to spawn any container they want, with no limitations.
  • For the full security report from the red team see https://gitlab.com/gitlab-com/gl-security/security-operations/gl-redteam/red-team-operations/rt004-ci-abuse
  • Information about what the job is doing inside of our infrastructure and what kind of network it's using.
  • Tests to validate a performant, and secure setup for users.

Specific topics that we didn't discuss

  • Cost optimization: Cost optimization really depends on the solution that we provide, it's something that we take into consideration when implementing the solution.
  • Scheduling improvements: Specifying which runner is picked up by which job.

Interviews with the community