Skip to content

Migrate traces to Object Storage(S3)

TODOs

  • All traces have been migrated to S3
  • Investigate the failed traces (i.e. Still stays in FS) during the migration

Current status

We have about 50,000,000 jobs today. Each job has a trace and each trace falls into the following era.

Era path counts artifact record? Stored in comments
1st - 87636 false Database(Postgresql) Issue => https://gitlab.com/gitlab-org/gitlab-ce/issues/34317
2nd #{builds_path}/#{YYYY_MM}/#{project_ci_id}/#{job_id}.log
(e.g. /builds/2018_01/1/12.log)
- false FileStorage(NFS) This can be considered as 3rd Era.
3rd #{builds_path}/#{YYYY_MM}/#{project_id}/#{job_id}.log
(e.g. /builds/2018_02/19/1172.log)
approx. 50,000,000 false FileStorage(NFS) -
4th(Latest) artifacts/#{SHA256['project_id']}/#{YYYY_MM_DD}/#{job_id}/#{job_artifact_id}/trace.log
(e.g. artifacts/94/00/94..67/2018_02_13/1374/249/trace.log)
increasing since %10.5 true FileStorage(NFS) or ObjectStorage(S3) We can easily move traces between FS <=> OS

Goals

Migrate all trace files from 2nd and 3rd era to 4th era

The current approach

Proposed by @ayufan https://gitlab.com/gitlab-org/gitlab-ee/issues/4170#note_55903624

We're preparing a script to manually migrate trace files.

The logic of the script

  1. Choose a trace file
  2. Detect job_id from the file name. This job will be associated with an artifact record.
  3. Clone the trace file in tmp folder
  4. Create an artifact record with the cloned trace file. If BackgroundUploader is ON, this trace moves to ObjectStorage, otherwise remains in FileStorage.
  5. Verify checksum between the uploaded file and the original trace file.
  6. Remove the cloned trace file
  7. Move the original trace file to a backup folder.

The usage of the script

  • rake gitlab:traces:migrate['2018_02/19/1172.log'] # Migrate a trace file
  • rake gitlab:traces:migrate['2018_02'] # Migrate trace files in builds/2018_02 folder
  • rake gitlab:traces:migrate['.'] # Migrate all trace files

Concerns

  • What if traces are missing by some reason? How can we recover it?
  • How can we speed up? How can we parallelize the processes?
  • When will the migration be done if we stick with the current approach? How long does it take per one approach?

/cc @andrewn @ayufan

Edited by Shinya Maeda