Draft: Add support to upload a CI job artifact to its final location on Object Storage (!105661) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 10io-ci-artifacts-final-location into master Dec 01, 2022

⚠ About this MR

This MR is a very quick PoC to deal with #285597 (closed).

As such, code quality (no tests and rubocop violations) is not great on this MR.

⛔ Never ever merge this MR ⛔.

💼 Summary

We can avoid the object storage copy on ci job artifacts (#285597 (closed)) without creating a ci job artifact row during the /authorize request.

This can be done by storing all the fields needed for the object storage key on a side table ("pending" table). Among those fields, we have the ci job artifact id. This id can be read an stored on this side table by reading the next value of the primary key sequence.

🔭 Context

CI job artifact uploads are powered by direct uploads.

In simplified terms, workhorse detects that upload and ask the rails backend where to put that upload. Workhorse will put the upload to that location and call rails again to "finalize" the upload.

In a very simple summary:

sequenceDiagram
  autonumber
  Runner->>Workhorse: Here is a CI job artifact file.
  Workhorse->>Rails: Hello there! I have a file upload which is a CI job artifact, where do I put it?
  Rails->>Workhorse: Sure, put it here (X).
  Workhorse->>"Object Storage": File uploaded to (X).
  "Object Storage"->>Workhorse: Upload accepted and completed! 
  Workhorse->>Rails: Upload done! It's here (X).
  Rails->>Rails: Process the uploaded file
  Rails->>Workhorse: Ok, all good.
  Workhorse->>Runner: Ok, all good.

As we can see, this is a quite involved 🏓.

In the current implementation and if we zoom on interaction (7), due to technical reasons, the file is actually moved on Object Storage. In other words, Rails will move the file on Object Storage from one location (temporary location) to another location(final location).

This move in Object Storage is certainly not free nor it is instantaneous. With large files, we are observing issues, see #285597 (closed).

🚒 Solution

In the discussions of #285597 (closed), the following idea sparkled: can we upload the CI job artifact file to its final location? By doing so, we would avoid the object storage copy.

The main challenge of this solution is the final location key ((X)). This key is not a random one, it's actually a function that depends on:

Job id.
Project id.
Job artifact id.
Job artifact created_at.

The problem is that this key needs to be known when workhorse ask rails for the location (interaction (3.)). In the current implementation, the project and job ids are known but the job artifact data is not there (yet). That's because it is created later, on interaction (7.).

How do we solve this?

Well, in !105074 (closed), the approach explored is to create a "pending" ci job artifact. This way, we can create it during (3.) and use it to compute the final object storage key.

The downside of this approach is: this creates what we can call "pending" ci job artifacts that can be left behind. Imagine that there is an issue between (3.) and (7.) so that only (3.) is executed = we now have stale ci job artifacts that we need to clean up (probably with a background job). In addition, the current queries on the ci job artifacts table need to be updated so that they exclude "pending" ci job artifacts.

🤔 What If?

Given that ci job artifacts is a high traffic table, the thought of having "pending" rows that need to be excluded in all read queries was not super attractive.

Can we do better?

Well, the whole problem is having the job artifact id + created_at at the authorize request, interaction (3.). What if we generate them in advance, we store them and when we want to create the ci job artifact row, we simply read them and set them on the row (during interaction 7.).

That's easy for job artifact created_at. That's a "standard" value that we can save in one table, read it and save it on the ci job artifacts row. That's pretty straightforward.

Not so much for job artifact id. That's the primary key, we can't do what we want here.

I think there is a way to get the primary key in advance. Nothing magical here: the idea is to simply read the related primary key sequence to get the next available id.

In other words, the idea here is to have a "pending" ci job artifacts table.

On (3.), we insert a new row in this "pending" table with all the values needed for the key. Among other things, we get the next available primary key.
On (7.), we can simply move the values from this "pending" table to the final ci job artifact row. As a bonus, we can delete the row in the "pending" table as it is not used anymore.

The main benefits of the above is:

We don't create stale ci job artifacts at all = all existing queries/indexes are untouched.
We still respect the database constraints. In particular, in regards of the primary key.
The "pending" table is limited in size because all successful uploads will remove their row from that table.
We can have stale rows in the "pending" table. Those will need to be cleaned up. From the previous point: the table size should be limited and thus quite small.

🏗 This MR

This MR is basically a PoC for the idea above:

Create a ci_job_artifact_pending_uploads table with columns for the project id, job id, job artifact id, remote id, created_at and updated_at.
Update the authorize and finalize actions on the ci job artifact uploads so that the pending row is created during authorize and reread for the finalize action.
Update object storage logic so that clients (in this case the ci job artifact upload endpoints) can send the final location that the file need to be on Object Storage.
Gate the changes behind a feature flag.

🍿 Demo time

The setup is a very simple one:

Object storage setup for Minio.
Consolidated configuration not used.

A project with the following .gitlab-ci.yml:

default:
  image: alpine:latest
txt:
  script: echo "This is a sample text" > bananas.txt
  artifacts:
    paths:
    - bananas.txt
    expire_in: 1 week

⛳ Without the feature flag enabled, aka current situation

We run a pipeline and here what minio receives as requests:

We can observe:

an s3.CopyObject request is triggered. 👈 That's the one we want to avoid.
multiple s3.DeleteObject requests are triggered.
In total, we have 13 object storage interactions (and that's for a single tiny text file).

⛳ With the feature flag enabled, aka this MR.

Again, we run a pipeline and here is what minio receives:

We can see:

no s3.CopyObject request triggered. 🎉
no s3.DeleteObject requests triggered. 🎉
In total, we have 8 object storage interactions (thats a -38% reduction).

🚧 What this MR needs to be production ready

A database review, we need more 👀 around the insert with the primary key sequence read.
Test this MR under different object storage configurations:
- object storage disabled (filesystem).
- consolidated configuration.
Improve the changes here:
- The changes in app/services/ci/job_artifacts/create_service.rb need a refactoring.
  - We need a transaction in the finalize request (7.). Currently this transaction is containing to many function calls, we should reduce that transaction to simply the ci job artifact row insert and pending ci job artifact removal.
  - a find_by is used. We need a model scope.

🔮 Possible follow ups

Investigate all those GET requests that we see on minio and try to remove them.
Refactor/improve the solution so that it can be used for other uploads.
- We have #348959 in this direction.
Up for another crazy idea? Given that the "pending" table becomes "temporary" storage (between 3. and 7.), then we could use a different storage for this. One good candidate could be Redis. Why Redis? That's because we can have expiring keys = we don't need a background job to cleanup stale pending data. Redis will automatically remove them for us 😸

Edited Dec 02, 2022 by David Fernandez

Draft: Add support to upload a CI job artifact to its final location on Object Storage