Evaluate fleeting-plugin-kubernetes as alternative to kubernetes executor
## Background
Thus far we have been evaluating migrating existing workloads to Kubernetes via the [kubernetes executor](https://docs.gitlab.com/runner/executors/kubernetes/).
## Problem
One of the main challenges we've faced here is incompatibility between the existing `docker+machine` setup and kubernetes.
Some of these are related to resource allocation, which gVisor and firecracker mitigate by limiting the guest's view of the host.
However, some issues are due to differing behaviour between the executors, often architectural in nature. Examples include:
- **Differing namespacing model:** Docker creates a separate network namespace per container. Services get their own IP. It's possible to have multiple services bind to the same port. In a Kubernetes Pod, all containers share the same IP namespace. We cannot support e.g. multiple redises unless we override the port.
- **Limited access to container metadata:** Docker exposes container information directly, which allows us to `docker inspect` images and make decisions based on that metadata. With Kubernetes this information is hidden behind the runtime layer. So we either need to accept that we have less information available and pass it explicitly, or re-implement bits and introduce a potential layering violation. See also: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/6037.
There are likely other similar issues lurking.
## Idea
In conversation with @josephburnett, he suggested we take a look at [fleeting](https://docs.gitlab.com/runner/fleet_scaling/fleeting/) with [`fleeting-plugin-kubernetes`](https://gitlab.com/gitlab-org/fleeting/fleeting-plugin-kubernetes). The implementation exists in an MR: https://gitlab.com/gitlab-org/fleeting/fleeting-plugin-kubernetes/-/merge_requests/1.
We would use the [`docker-autoscaler`](https://docs.gitlab.com/runner/executors/docker_autoscaler/) executor, configure the fleeting plugin with the kubernetes plugin. Then configure a `PodTemplate` that exposes the Pod via `ssh`.
The runner would then provision a new Pod. This Pod has both SSH and docker daemon running. We connect over SSH, then connect to docker daemon via `dial-stdio`, then hand that off to the `docker` executor.
Something along these lines:
```
┌─────────────────┐ SSH tunnel ┌──────────────────────────┐
│ GitLab Runner │ ──────────────────► │ Fleeting Instance │
│ │ │ (VM or Pod) │
│ docker- │ Docker API │ ┌──────────────────┐ │
│ autoscaler │ ──────────────────► │ │ Docker Daemon │ │
│ │ (tunneled) │ │ │ │
│ docker │ │ │ ┌────────────┐ │ │
│ executor │ │ │ │ Job │ │ │
└─────────────────┘ │ │ │ Container │ │ │
│ │ └────────────┘ │ │
│ └──────────────────┘ │
└──────────────────────────┘
```
There may be some room to change this this interface though:
- We could enable connecting via Kubernetes exec directly instead of SSH
- We could enable kubernetes-level port-forwarding instead of SSH
- The docker client could then use [`dial-stdio`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/16ba4945040d02beb7c9560cd3265240ee9290a0/executors/docker/docker.go#L1221) over exec to talk to the docker daemon.
## Analysis
Pro:
- Allows for execution environment that is closer to existing `docker+machine`, this may reduce divergence and increase compatibility.
- May allow best-of-both-worlds, the compatibility of docker and the scheduling + sizing flexibility of kubernetes, see also: https://gitlab.com/gitlab-com/runner-group/team-tasks/-/work_items/437.
- Buying in to fleeting abstraction enables portability across providers (though arguably buying in to kubernetes does the same).
- We get to test/dogfood fleeting at scale, and gain early access to features being integrated in fleeting first, such as steps/functions.
Con:
- Immature, implementation exists in a 2+ year old open MR.
- Diverges from Kubernetes executor (one of most popular ways of deploying runners).
- Moves impedance mismatch elsewhere, we become more reliable on SSH+docker as an interface.
- Introduces additional Docker (and SSH) layer, may have additional compatibility, stability, performance, debugging, and maintenance implications.
## Decision
We need to decide:
- Is this approach technically feasible?
- Does this approach make sense for our Hosted Runner goals?
- Does this overall direction make sense?
- Does this align with other dogfooding goals, e.g. improving the Kubernetes executor?
- If we decide to go with this, what changes are needed to support our use case? If we don't, how does that impact our story for kubernetes executor adoption?
cc @ayeung @rehab @kkyrala @tmaczukin @josephburnett
issue