Migrate job artifacts data from `ci_builds` to `ci_job_artifacts` table (with Background migrations) (!18615) · Merge requests · GitLab.org / GitLab FOSS

Summary of this MR

This merge request migrates job artifacts data from ci_builds to ci_job_artifacts table. After we've accomplished this, we can simplify SQLs for looking up job artifacts, and effectively fixes broken ExpireBuildArtifactsWorker.

History

Until GitLab 10.3, job artifacts data had been stored in ci_builds table. Everytime GitLab-rails received artifacts blob from GitLab-runners, ci_builds.artifacts_file and ci_builds.artifacts_metadata columns were updated and actual files were stored into a directory.

But it turned out this architecture was not scalable, because we had to add a new column to ci_builds per introducing a new artifact type. Therefore, in GitLab 10.3, we released a new home for job artifacts - ci_job_artifacts. This table persists one artifact data per a row, and since it has file_type column (Enum), it's easy to add a new artifact type. We've been using this architecture since 10.3, and so far there are no problems.

To summarize our current situation, please take a look at the following table.

Timeline	Table	Path (Example)
Job artifacts (~ %10.3)	`ci_builds.artifacts_file`, `ci_builds.artifacts_metadata`	"shared/artifacts/2018_03/12/10/archive.zip" "shared/artifacts/2018_03/12/10/metadata.gzip"
Job artifacts (%10.3 ~ Present)	`ci_job_artifacts.file`, `ci_job_artifacts.file_type` (`archive: 1, metadata: 2`)	"shared/artifacts/...hash.../100/10/1/archive.zip", "shared/artifacts/...hash.../100/10/2/metadata.gzip"

Also, please note that we often refer the job artifacts before 10.3 as Legacy job artifacts.

Why this MR is necessary

Since 10.3, the new home ci_job_artifacts has been working finely on our production server. But the legacy artifacts data still resides in ci_builds table. Also, we found some SQL query is slow because those have to SELECT from both ci_builds table and ci_job_artifacts tables. In fact, ExpireBuildArtifactsWorker(Cron worker) has been broken on gitlab.com for a few months that it fails every time by statement timeouts. In order to fix those problems, we need to migrate legacy artifacts to new home, and accomplish the single source of truth.

How this MR migrates

We migrate data from ci_builds to ci_job_artifacts table, however, the thing is not simple, because we also have to consider migrations for actual files (blobs).

To understand this problem, here are the columns to persist legacy artifacts information.

ci_builds.artifacts_file ... File name of the artifact file
ci_builds.artifacts_file_store ... File store of the artifact file (store can be either local file storage or remote object storage)
ci_builds.artifacts_size ... File size of the artifact file
ci_builds.artifacts_metadata ... File name of the artifact metadata file
ci_builds.artifacts_file_store ... File store of the artifact metadata file (store can be either local file storage or remote object storage)

and the actual files were stored in a directory. The convention of a directory path is shared/artifacts/YYYY_MM/project_id/build_id/.

Migrating data from the above columns is easy, however, migrating actual files is difficult, because those files have already been uploaded to object storage (S3) on gitlab.com. In addition. there are no API provided from OS, to move files as batch operation. If we move actual files one by one, this will take very long time. Also, ingress cost can't be ignorable as network traffic going back and forth between GCP and S3. We should avoid this problem, at least for now.

Actually, there are no problems even if we didn't migrate actual files that it's not stored in database anyway. Wherever it is, we can look up as long as we persist the information of where it's stored.

So, this MR migrates data in database only. Here is the expected result after this migration finished.

Timeline	Table	Path (Example)
Job artifacts (~ Present)	`ci_job_artifacts.file`, `ci_job_artifacts.file_type` (`archive: 1, metadata: 2`)	If `file_location` is `:legacy_path`, then it uses "shared/artifacts/2018_03/12/10/archive.zip" "shared/artifacts/2018_03/12/10/metadata.gzip". If `file_location` is `:hashed_path`, then it uses "shared/artifacts/...hash.../100/10/1/archive.zip", "shared/artifacts/...hash.../100/10/2/metadata.gzip".

Statics on gitlab.com

The number of legacy artifacts ... 3,630,545
The number of ci_builds ... 89,972,286
The number of ci_job_artifacts ... 77,238,225
The biggest gap in a single batch range ... Batch: 9538 range: 9603987.. 9705164 (delta: 101,177)

NOTE: https://gitlab.com/gitlab-com/infrastructure/issues/4212 is the investigation issue

Future iterations

Clean up MigrateLegacyArtifacts background migrations with BackgroundMigration.steal

This MR uses background migrations, thus it needs to clean up at the next major/minor release as it's described in handbook. Also, during the cleanup-phase, the partial index, which was added in this MR, will be removed, as it's no longer necessary.

Fill nullified file_location on ci_job_ artifacts table

This MR adds a new column file_location to indicates that where actual files are stored. If possible, we want to add the column with the default value and NOT NULL constraint, however, we don't have a way to accomplish this, given ci_job_artifacts is a very big table, which holds over 50M rows.

Fill size values of ci_job_artifacts table from object storage

Legacy artifact rows persisted file size, however, legacy artifact metadata rows didn't persist file size. This means all migrated rows of artifact metadata are empty. We need to follow-up in the issue.

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Tests added for this feature/bug
Review
- Has been reviewed by Backend
- Has been reviewed by Database
End-to-end tests pass (package-and-qa manual pipeline job)

What are the relevant issue numbers?

Close https://gitlab.com/gitlab-org/gitlab-ce/issues/46652

Edited Sep 04, 2018 by Kamil Trzciński

Migrate job artifacts data from `ci_builds` to `ci_job_artifacts` table (with Background migrations)