Support multiple isolation levels for jobs - none, cgroups, docker, microVM, VM, instance
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Description
GitLab Runner has examples for different executors. We provide examples for Shell, Docker, Docker Machine, Docker Autoscaler, Instance, Kubernetes, SSH, Parallels, VirtualBox executors. For custom executors, we provide examples for libvirt and LXD. With Pluggable gRPC executors blueprint (gitlab-com/content-sites/handbook!15068 - closed) , we are proposing to make all executors as pluggable gRPC executors. This provides extensibility and will supercede custom executors.
MicroVMs provide stronger isolation and security boundary as compared to a container and are lighter than a full-blown VM. There are many powerful microVM technologies which are can be integrated in Kubernetes through Kata Containers.
However, there should be a way to run CI jobs in GitLab on a microVM without needing Kubernetes. Not to mention that here are always some customers who just cannot/will not adopt Kubernetes.
Managing Kubernetes is not easy due to operational and maintenance complexity. Managing microVMs(e.g. firecracker) is not easy due to operational and maintenance complexity. Expecting customers to combine the two through Kata Containers to be able to run CI jobs in microVMs(for better isolation and security) is, in all likelihood, a tall task. Thus, we(GitLab) should provide them more first-class support for microVMs (without needing Kubernetes).
The idea is that the user should have a range of isolation levels that can be applied to jobs -
- none
-
cgroups - see Add slot-based cgroup support for Docker executor (gitlab-runner!5870 - merged) • Joe Burnett • 18.6. We can use the "slot" concept to keep jobs from consuming each other's resource, even when they inherently "trust" each other. No container boundary necessary, this can work with the
instanceexecutor too. -
containers - such as
dockeranddocker+autoscaler - nested VMs - could be micro VMs or actual VMs on a dedicated instance.
- cloud provided VMs - this is running one job per VM and then throwing it away. The easiest and strongest isolation which we currently use in production, managed by our cloud provider
The expected outcome of this issue are working examples for microVMs(e.g. firecracker) using fleeting plugins with taskscaler for autoscaling and scheduling multiple jobs on an instance where each job is a separate microVM. The end goal is to highlight to the customer that they can do this "natively" in GitLab without requiring any third-party tool(e.g. https://actuated.com/). Running CI jobs through microVMs also opens up some other interesting ideas(e.g. snapshotting CI jobs and restarting them in very short time, etc.). This is also pretty useful for other domains at GitLab should they wish to adopt it(e.g. DAP, Workspaces, etc.).
Some potential use cases of microVM snapshotting and restoring are
- Allow users to create a snapshot of a job (just like we allow creating artifacts/cache) through CI syntax.
- Allow users to restore a snapshotted job through CI syntax.
- Efficient/faster matrix testing using snapshot branching through CI syntax. Can further improve this by adding some sort of copy-on-write approach during branching.
- Allow users to time travel for debugging by creating snapshots at specific points of execution. e.g. when a job fails, create a snapshot and provide an option to restore that snapshot using a workspace for easier debugging? Or maybe even restore it locally?
Proposal
- Add a new executor which is a pluggable gRPC executor. This will execute prepare/run/cleanup steps in the context of a microVM. A new microVM is created for each new job.
- Use the fleeting plugins(aws, azure, gcp, etc.) to autoscale the instances.
- The Taskscaler library is capable of tracking "slots" within a given instance. So it can autoscale a fleet of instances and schedule multiple jobs per instance. It uses a tool called Nesting to manage individual VMs on a dedicated instance in AWS. So a reservation from Taskscaler contains both a specific instance and a slot index. Runner will connect to the instance, but instead of running the job right away, it will connect to the nesting gRPC service and request for the desired VM image to be created. It will automatically "stomp" whatever is in that slot. Once the VM is up, runner will hop once more into the VM and run the job.
- The nesting library contains several virtualization providers. Support for
libvirtis already added. This would be a logical place to extend support to a given micro VM technology.