New CI Job live-trace architecture

Description

We are currently working on making GitLab Cloud Native-compatible. CI/CD team has two objectives for this. One is "Artifacts Direct to CI", and another one is "Build Logs on Object Storage". This issue specifically discusses the latter one - "Build Logs on Object Storage".

Current architecture - Why do we need to change?

Runner requests a job to Unicorn and picks one.
While the job is running, Runner sends chunks of the trace to Unicorn. Unicorn appends those chunks to FileStorage.
When the job is done, Runner sends the job result and the full trace to Unicorn. Unicorn writes the full trace to FileStorage.

As you can see, currently we are saving traces in FileStorage, which is against to this direction We no longer rely on FileStorage, even if it's temporary. So somehow, we want to bypass FileStorage

However, this is not trivial, because job-traces are dynamic objects which are updated periodically while the job is running, whereas job-artifacts are uploaded only once when the job is done.

After I've investigated, I came up with some ideas. Each idea has pro/con and I'd like to discuss what's the best solution. Any suggestions, feedbacks and complaints are welcome.

Idea 1: Just replace FileStorage to ObjectStorage

Let's consider the simplest solution. Just replace FileStorage to ObjectStorage.

Runner requests a job to Unicorn and picks one.
While the job is running, Runner sends chunks of the trace to Unicorn. Unicorn appends those chunks to ObjectStorage.
When the job is done, Runner sends the job result and the full trace to Unicorn. Unicorn writes the full trace to ObjectStorage.

This solution has a problem at the second step, because ObjectStorage does not allow Appending operation. If we really want to do this, we need 3 steps - 1) Fetch the latest trace from ObjectStorage 2) Appends a new chunk 3) Push back to ObjectStorage. This causes significant network throughputs, and this is much slower than reading traces from FileStorage.

There is a workaround for this Appending problem - Use multipart upload API, which is provided by each cloud service (This exmaple is from S3).

Upload objects in parts—Using the multipart upload API, you can upload large objects, up to 5 TB. The multipart upload API is designed to improve the upload experience for larger objects. You can upload objects in parts. These object parts can be uploaded independently, in any order, and in parallel. You can use a multipart upload for objects from 5 MB to 5 TB in size. For more information, see Uploading Objects Using Multipart Upload API.

Since this API is designed for a different purpose (for big data, especially), there are some concerns

Object sizes need to exceed 5MB
Can users read the trace while it's being uploaded with multipart upload API?
Is there any similar API in GCP? (TODO: We should research this at first)

Idea 2: Store chunks separately

We had a problem with Appending operation, so how about saving chunks separately?

Runner requests a job to Unicorn and picks one.
While the job is running, Runner sends chunks of the trace to Unicorn. Unicorn writes those chunks to ObjectStorage separately. (e.g. path/to/trace/chunk_1.log, path/to/trace/chunk_2.log ...)
When the job is done, Runner sends the job result and the full trace to Unicorn. Unicorn writes the full trace to ObjectStorage and remove chunk_*.log

This technical difficulty resides "how we can read those separated chunks as ONE object". Probably we should have a DB table to store byte_start and byte_end for each chunk, and when the trace is read, calculate the relevant pointer and fetch the necessary chunks from ObjectStorage.

Of course, once the job is done, we don't need this logic. This is needed for only running jobs.

Idea 3: Store chunks in Redis

We use Redis while streaming the trace, this is simpler/faster than Idea 1 and 2.

Runner requests a job to Unicorn and picks one.
While the job is running, Runner sends chunks of the trace to Unicorn. Unicorn appends the chunks to Redis.
When the job is done, Runner sends the job result and the full trace to Unicorn. Unicorn uploads the full trace to ObjectStorage directly, and flush(clear) the chunks stored in Redis.

As a bonus point, we can effectively use Redis methods to build up our custom logic.

Although, there are some concerns

Huge memory consumption on Redis (We assume this would be 10GB-20GB). In addition, since we have auto-scaled runners, we have to auto-scale clusters of Redis to prevent Redis outage. e.g. "3GB PODs(Redis) x N > The total size of running job's chunks of traces"

Idea 4: Socket communication with Runner

This is an idea that Runners provide an ability of streaming traces. This is too ambitious and complicated to achieve, but you can take a look at the details at https://gitlab.com/gitlab-org/gitlab-ee/issues/4180#note_54281489.

Sub-objectives

Make traces as artifacts. Traces are produced by jobs(runners) which can be considered as a part of artifacts. In other words, job.job_artifacts_trace should represent the own trace instance.
Secure trace paths. Today we store to Gitlab::Ci::Trace#default_path (e.g. "builds/2017_06/2/222.log"). This can be easily guessed which project_id and job_id are used in the GitLab instance. We want to secure those paths as well as artifacts does (e.g. JobArtifactUploader#default_path "4a/44/4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5/2017_12_14/938/1/artifacts.zip")
Migrate 2TB traces on GitLab.com to ObjectStorage

Links / references

Edited Mar 22, 2018 by Shinya Maeda