Migrate job artifacts data from `ci_builds` to `ci_job_artifacts` table (with Background migrations)
Summary of this MR
This merge request migrates job artifacts data from
ci_job_artifacts table. After we've accomplished this, we can simplify SQLs for looking up job artifacts, and effectively fixes broken
Until GitLab 10.3, job artifacts data had been stored in
ci_builds table. Everytime GitLab-rails received artifacts blob from GitLab-runners,
ci_builds.artifacts_metadata columns were updated and actual files were stored into a directory.
But it turned out this architecture was not scalable, because we had to add a new column to
ci_builds per introducing a new artifact type. Therefore, in GitLab 10.3, we released a new home for job artifacts -
ci_job_artifacts. This table persists one artifact data per a row, and since it has
file_type column (Enum), it's easy to add a new artifact type. We've been using this architecture since 10.3, and so far there are no problems.
To summarize our current situation, please take a look at the following table.
|Job artifacts (~ %10.3)||
|Job artifacts (%10.3 ~ Present)||
Also, please note that we often refer the job artifacts before 10.3 as Legacy job artifacts.
Why this MR is necessary
Since 10.3, the new home
ci_job_artifacts has been working finely on our production server. But the legacy artifacts data still resides in
ci_builds table. Also, we found some SQL query is slow because those have to
SELECT from both
ci_builds table and
ci_job_artifacts tables. In fact,
ExpireBuildArtifactsWorker(Cron worker) has been broken on gitlab.com for a few months that it fails every time by statement timeouts. In order to fix those problems, we need to migrate legacy artifacts to new home, and accomplish the single source of truth.
How this MR migrates
We migrate data from
ci_job_artifacts table, however, the thing is not simple, because we also have to consider migrations for actual files (blobs).
To understand this problem, here are the columns to persist legacy artifacts information.
- ci_builds.artifacts_file ... File name of the artifact file
- ci_builds.artifacts_file_store ... File store of the artifact file (store can be either local file storage or remote object storage)
- ci_builds.artifacts_size ... File size of the artifact file
- ci_builds.artifacts_metadata ... File name of the artifact metadata file
- ci_builds.artifacts_file_store ... File store of the artifact metadata file (store can be either local file storage or remote object storage)
and the actual files were stored in a directory. The convention of a directory path is
Migrating data from the above columns is easy, however, migrating actual files is difficult, because those files have already been uploaded to object storage (S3) on gitlab.com. In addition. there are no API provided from OS, to move files as batch operation. If we move actual files one by one, this will take very long time. Also, ingress cost can't be ignorable as network traffic going back and forth between GCP and S3. We should avoid this problem, at least for now.
Actually, there are no problems even if we didn't migrate actual files that it's not stored in database anyway. Wherever it is, we can look up as long as we persist the information of where it's stored.
So, this MR migrates data in database only. Here is the expected result after this migration finished.
|Job artifacts (~ Present)||
Statics on gitlab.com
- The number of legacy artifacts ... 3,630,545
- The number of ci_builds ... 89,972,286
- The number of ci_job_artifacts ... 77,238,225
- The biggest gap in a single batch range ...
Batch: 9538 range: 9603987.. 9705164(delta: 101,177)
NOTE: https://gitlab.com/gitlab-com/infrastructure/issues/4212 is the investigation issue
This MR uses background migrations, thus it needs to clean up at the next major/minor release as it's described in handbook. Also, during the cleanup-phase, the partial index, which was added in this MR, will be removed, as it's no longer necessary.
This MR adds a new column
file_location to indicates that where actual files are stored. If possible, we want to add the column with the default value and
NOT NULL constraint, however, we don't have a way to accomplish this, given
ci_job_artifacts is a very big table, which holds over 50M rows.
Legacy artifact rows persisted file size, however, legacy artifact metadata rows didn't persist file size. This means all migrated rows of artifact metadata are empty. We need to follow-up in the issue.
Does this MR meet the acceptance criteria?
- Changelog entry added, if necessary
- Tests added for this feature/bug
- Has been reviewed by Backend
- Has been reviewed by Database
End-to-end tests pass (
package-and-qamanual pipeline job)
What are the relevant issue numbers?
Close #46652 (closed)