Draft: MVC Add fault tolerance with file storage (!3512) · Merge requests · GitLab.org / gitlab-runner

Problem to Solve

One issue GitLab Runner has is its lack of fault tolerancy. This is particularly painful in an environment that's inherently fault tolerant such as Kuberentes. With the Kubernetes executor being used more and more this becomes a more glaring issue.

For the context of this MVC lets only think in terms of a Kuberentes environment and the Kubernetes executor since in my mind that's the primary target demographic for this functionality.

Example:

GitLab Runner runs a job in Kubernetes. GitLab Runner runs in one Pod, while the job runs in another Pod. Runner monitors the logs of the job and extrapolates its state from it, switching between the appropriate stages. At some point the Runner pod gets killed by Kubernetes for whatever reason, for example, the Node is restarted, so the Pods running on it need to be evacuated to another Node. When the Runner Pod is brought back up it has zero knowledge of what the previous Pod was doing before it was shut down. It doesn't know of any jobs that were started and are most likely still running somewhere on the cluster. Now, these jobs are not monitored by anything so they keep running unchecked, their logs are not sent to GitLab, their status is not being reported, they'll be perpetually in a Running state until GitLab decides that they need to timeout.

We introduced gitlab-runner-pod-cleanup to solve the issue of lefover resources, however that is only a bandaid fix. In my mind, in the above example GitLab Runner should be able to pickup the job where it left it after it gets back up, update the logs and the state of the job in GitLab and make sure to clean up any leftover resources, just as if it was never restarted in the first place.

Action plan:

Only the Attach strategy should support this

Change GitLab Runner's internal most base handling of jobs. It has no concept of resuming a job. Its memory model should be adjusted so that it can pick up existing jobs.
Change GitLab Runner's Kubernetes model to be able to reattach to a Running pod. With the attach strategy that is already possible. What it's missing is the ability to repopualate the executor's knowledge of already existing objects - pod, secrets, volumes, etc. We should fetch all these from Kubernetes, update the Kuberentes executor's knowledge of them and reattach to the log of the pod. From there the log processor should pick it up and report back the status of the job.
With Runners picking up already Running jobs there's the problem contention. We should not allow two Runner instances pick up the same job. This can be done with a simple Redis or etcd interface. It could even be a file base database for the MVC, doesn't really matter. Reading this you might think this is not needed and we can just run one Runner for the MVC, and you would be absolutely right, however:
We need a place to store all the jobs and resources created by them. We could query the kubernetes cluster, filter by labels etc. but I think that's quite limiting and cumbersome. It's also not portable across executors. We could use the cache of our choice to store all the resources we created, then use that information for starting Runners to give them knowledge about these resources in a clean and predictable matter.

Why this is important?

Kubernetes is an important part of the future of GitLab Runner and without fault tolerance we can never say we truly support it.

Requested expertise

GitLab Runner, GitLab API, Kubernetes, Redis (etcd?, sqlite?), Go

Effort expected

5 days for a hacked-no-tests-glued-with-gum MVC sounds reasonable

Artifacts to be produced

1 MR with all the changes needed to give a baseline for the future.

Also a video recording of a demo ✌

FAQs

Q: Are you sure this can actually work?
A: That's a chrome-plated shmaybe at most from me.
Q: What's the meaning of life?
A: To make GitLab Runner fault tolerant and distributed(a bit of a spoiler eh).

#36951

Edited Oct 17, 2023 by Darren Eastman

Draft: MVC Add fault tolerance with file storage

Merge request reports