Generic job API to run compute tasks

Problem to solve

We were discussing the idea, expressed in the Modern CI is Too Complex and Misdirected blogpost, with a friend of mine @Malinskiy around a month ago (i.e. before the blogpost came out)! Quite a coincidence. The point of the discussion was that it’d be awesome if GitLab exposed a generic job scheduling API (I'll refer to it as "job API" below) that can be used by third party software (itself running in a CI job or not) to execute some tasks.

As a concrete use case, @Malinskiy is building Marathon - a test scheduling framework for mobile tests (and not only) that could benefit from such an API. The number of tasks is unknown in advance and maybe even changes during the execution (e.g. failed tasks re-execution).

With such a job API we could have an ecosystem of tools that work with it, build on top of it. This is similar to an ecosystem around GitHub Actions the blogpost author mentions. I.e. GitLab instance user/owner can use third party software, that utilizes this job API.

From the customer’s perspective, they would need to install a bunch of Runners on their hardware/cloud and the job API would schedule work on them. Ideally autoscalling should be taken care of too.

I agree with the blogpost - CI can be viewed as one way to utilize a generic "remote code execution as a service API". But there are more ways to use the compute if there was an API to it. I imagine it as a layered cake - job API at the bottom and various layers that build on top and on the side of each other:

job API <- bazel’s remote execution API <- used by bazel (see below)
job API <- GitLab CI <- used by existing customers
job API <- Marathon scheduler <- user running their tests
job API <- advanced customers dynamically scheduling tasks from their CI jobs, etc.
etc

On top of that job API a first-class Bazel support could be built (if we want to compete with existing options):

provide a remote execution API
support build event protocol parsing to display rich info about what’s happening in the build in real time
coverage integration

Generic job API to run compute tasks

Problem to solve

Intended users

User experience goal

Proposal

Further details

Permissions and Security

Documentation

Availability & Testing

Available Tier

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references