Migrate away from Docker Machine for autoscaling
Introduction
GitLab Runner provides autoscaling that provides the ability to utilize resources in a more elastic and dynamic way. Under the hood, this is using Docker Machine to provision the machine for multiple cloud providers thanks to it's machine drivers. The problem with using Docker Machine is that it's in maintenance mode, which puts us in a hard position to keep using it, we already have a fork with specific fixes for GitLab.com which is not ideal since it's a large maintenance cost for the ~Verify team. We need to discuss/think of ways to on how to support autoscaling without Docker Machine.
Alternatives to Docker Machine
Infrakit
infrakit is the successor of Docker Machine that is also maintained by Docker Inc. Using infrakit would require us to keep using the existing scheduler that we use for Docker Machine. The scheduler works fine, it has been working fine for a long time, but it does bring a lot of maintenance, technical debt and everything else more software we maintain brings when compared to something we get for "free" using kubernetes. It's not clear if infrakit can be used to provision Windows-based machines and that is something we need to verify.
Kubernetes
We generally push customers to use the kubernetes executor since it provides autoscaling out of the box and has one of the best workload schedulers. When customers find issues with Docker Machine we always suggest to use Kubernetes since it's better, and they get a lot more benefits. Currently, we can't run GitLab.com shared Runners on Kubernetes for the fact that we run untrusted code from users, that can be used to escape from containers and cause harm to the infrastructure. So we need some kind of isolation that a full-blown Virtual Machine brings or something similar, like what we have right now. There are a few areas we can/need to explore:
- There is already discussing the usage of k8s with https://github.com/google/gvisor to sandbox everything https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6846
- In the previous face to face chats, we talked about levering k8s scheduler to provision VMs instead of containers, there is a cool project that is doing the exact same thing https://kubevirt.io/
- Virtual kubelet, which will provision k8s nodes automatically, check the introductry video
- katacontainers with firecracker vms
Use terraform to provision machines
Terraform can be used to provision infrastructure. If we keep the same scheduler mechanism, but instead of running docker machine commands we run terraform would it be possible? The only issue with this is we will end up with the same problem as the executors, everyone wants his own provisioner to be used and not terraform.
Criteria to move away from Docker Machine
- All the benefits of autoscaling
- We need to provide an easy way for users to migrate to the new schedule/infra provisioner
- GitLab.com shared Runner can use this, with complete isolation from 1 job to another
- GitLab.com shared Runners are heavy users of this and we need to keep that in mind.
- It has to be able to provision Windows machine
- If possible it should be able to provide MacOs machines