Draft: Add support to upload a CI job artifact to its final location on Object Storage
⚠ About this MR
This MR is a very quick PoC to deal with #285597 (closed).
As such, code quality (no tests and rubocop violations) is not great on this MR.
💼 Summary
We can avoid the object storage copy on ci job artifacts (#285597 (closed)) without creating a ci job artifact row during the /authorize
request.
This can be done by storing all the fields needed for the object storage key on a side table ("pending" table). Among those fields, we have the ci job artifact id. This id can be read an stored on this side table by reading the next value of the primary key sequence.
🔭 Context
CI job artifact uploads are powered by direct uploads.
In simplified terms, workhorse detects that upload and ask the rails backend where to put that upload. Workhorse will put the upload to that location and call rails again to "finalize" the upload.
In a very simple summary:
sequenceDiagram
autonumber
Runner->>Workhorse: Here is a CI job artifact file.
Workhorse->>Rails: Hello there! I have a file upload which is a CI job artifact, where do I put it?
Rails->>Workhorse: Sure, put it here (X).
Workhorse->>"Object Storage": File uploaded to (X).
"Object Storage"->>Workhorse: Upload accepted and completed!
Workhorse->>Rails: Upload done! It's here (X).
Rails->>Rails: Process the uploaded file
Rails->>Workhorse: Ok, all good.
Workhorse->>Runner: Ok, all good.
As we can see, this is a quite involved
In the current implementation and if we zoom on interaction (7), due to technical reasons, the file is actually moved on Object Storage. In other words, Rails will move the file on Object Storage from one location (temporary location) to another location(final location).
This move in Object Storage is certainly not free nor it is instantaneous. With large files, we are observing issues, see #285597 (closed).
🚒 Solution
In the discussions of #285597 (closed), the following idea sparkled: can we upload the CI job artifact file to its final location? By doing so, we would avoid the object storage copy.
The main challenge of this solution is the final location key ((X)
). This key is not a random one, it's actually a function that depends on:
- Job
id
. - Project
id
. - Job artifact
id
. - Job artifact
created_at
.
The problem is that this key needs to be known when workhorse ask rails for the location (interaction (3.)). In the current implementation, the project and job ids are known but the job artifact data is not there (yet). That's because it is created later, on interaction (7.).
How do we solve this?
Well, in !105074 (closed), the approach explored is to create a "pending" ci job artifact. This way, we can create it during (3.) and use it to compute the final object storage key.
The downside of this approach is: this creates what we can call "pending" ci job artifacts that can be left behind. Imagine that there is an issue between (3.) and (7.) so that only (3.) is executed = we now have stale ci job artifacts that we need to clean up (probably with a background job). In addition, the current queries on the ci job artifacts table need to be updated so that they exclude "pending" ci job artifacts.
🤔 What If?
Given that ci job artifacts is a high traffic table, the thought of having "pending" rows that need to be excluded in all read queries was not super attractive.
Can we do better?
Well, the whole problem is having the job artifact id
+ created_at
at the authorize request, interaction (3.). What if we generate them in advance, we store them and when we want to create the ci job artifact row, we simply read them and set them on the row (during interaction 7.).
That's easy for job artifact created_at
. That's a "standard" value that we can save in one table, read it and save it on the ci job artifacts row. That's pretty straightforward.
Not so much for job artifact id
. That's the primary key, we can't do what we want here.
I think there is a way to get the primary key in advance. Nothing magical here: the idea is to simply read the related primary key sequence to get the next available id.
In other words, the idea here is to have a "pending" ci job artifacts table.
- On (3.), we insert a new row in this "pending" table with all the values needed for the key. Among other things, we get the next available primary key.
- On (7.), we can simply move the values from this "pending" table to the final ci job artifact row. As a bonus, we can delete the row in the "pending" table as it is not used anymore.
The main benefits of the above is:
- We don't create stale ci job artifacts at all = all existing queries/indexes are untouched.
- We still respect the database constraints. In particular, in regards of the primary key.
- The "pending" table is limited in size because all successful uploads will remove their row from that table.
- We can have stale rows in the "pending" table. Those will need to be cleaned up. From the previous point: the table size should be limited and thus quite small.
🏗 This MR
This MR is basically a PoC for the idea above:
- Create a
ci_job_artifact_pending_uploads
table with columns for the project id, job id, job artifact id, remote id, created_at and updated_at. - Update the authorize and finalize actions on the ci job artifact uploads so that the pending row is created during authorize and reread for the finalize action.
- Update object storage logic so that clients (in this case the ci job artifact upload endpoints) can send the final location that the file need to be on Object Storage.
- Gate the changes behind a feature flag.
🍿 Demo time
The setup is a very simple one:
- Object storage setup for Minio.
- Consolidated configuration not used.
- A project with the following
.gitlab-ci.yml
:default: image: alpine:latest txt: script: echo "This is a sample text" > bananas.txt artifacts: paths: - bananas.txt expire_in: 1 week
⛳ Without the feature flag enabled, aka current situation
We run a pipeline and here what minio receives as requests:
We can observe:
- an
s3.CopyObject
request is triggered.👈 That's the one we want to avoid. - multiple
s3.DeleteObject
requests are triggered. - In total, we have
13
object storage interactions (and that's for a single tiny text file).
⛳ With the feature flag enabled, aka this MR.
Again, we run a pipeline and here is what minio receives:
We can see:
- no
s3.CopyObject
request triggered.🎉 - no
s3.DeleteObject
requests triggered.🎉 - In total, we have
8
object storage interactions (thats a -38% reduction).
🚧 What this MR needs to be production ready
- A database review, we need more
👀 around the insert with the primary key sequence read. - Test this MR under different object storage configurations:
- object storage disabled (filesystem).
- consolidated configuration.
- Improve the changes here:
- The changes in
app/services/ci/job_artifacts/create_service.rb
need a refactoring.- We need a transaction in the finalize request (7.). Currently this transaction is containing to many function calls, we should reduce that transaction to simply the ci job artifact row insert and pending ci job artifact removal.
- a
find_by
is used. We need a model scope.
- The changes in
🔮 Possible follow ups
- Investigate all those
GET
requests that we see on minio and try to remove them. - Refactor/improve the solution so that it can be used for other uploads.
- We have #348959 in this direction.
- Up for another crazy idea? Given that the "pending" table becomes "temporary" storage (between 3. and 7.), then we could use a different storage for this. One good candidate could be Redis. Why Redis? That's because we can have expiring keys = we don't need a background job to cleanup stale pending data. Redis will automatically remove them for us
😸