General availability on New Live Trace architecture

We shipped the new live trace architecture(beta) in %10.8 . This feature is off by default. Here is the architecture details https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/administration/job_traces.md#new-live-trace-architecture

Problems

This is functional, however, there are some problems which hold us back from enabling this on production server.

This feature could incur data loss:
- Case 1: When all data in Redis are accidentally flushed.
On-going live traces could be recovered by re-sending traces (This is supported by all versions of GitLab Runner)
Finished jobs which has not archived live traces will lose the last part (~128kB) of trace data.
- Case 2: When sidekiq workers failed to archive (e.g. There was a bug that prevents archiving process, Sidekiq inconsistancy, etc):
Currently all trace data in Redis will be deleted after one week. If the sidekiq workers have not finished by the expiry date, the last part of trace data (~128kB) will be lost.
This feature could eat up a lot of memory on Redis instance and kill it by OOM. e.g. 10000 running jobs => 128KB * 10000 = 1.28 GB consumed.
We can prevent it by setting api limit, but this might not be a good solution.
We should have a way to mitigate the problem without having downtime

We can stop the growth of memory consumption by disabling this feature. It takes the effect immidiately after we fliped the feature flag (Feature.disable('ci_enable_live_trace')). After it's disabled, the total memory concomption will be declining slowly by releasing (archiving) the allocated data little by little.

This feature could pressure Database replication lag. INSERT are generated to indicate that we have trace chunk. UPDATE with 128kB of data is issued once we receive multiple chunks.
etc

Roadmap

Fundamental tests on dev.gitlab.org
- Unit/Integration tests are defined well, however we have to blackbox-test with an environment which has a similar spec with the production server
- Live traces should be visible on job pages
- Archived traces should be visible on job pages
- Live traces should be archived to Object storage
- Live traces should be cleaned up after archived
- [-] ~~Verify the data correctness. We can compare 1) Incomming full-trace from runners to 2) Presisted live-trace. ~~
- Test feature flag on/off and confirm we can pull the plug at any time
Stress tests on dev.gitlab.org
Schedule 100 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load,
Schedule 1000 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load,
[-] ~~Schedule 10000 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load, etc.~~ => dev.gitlab.org only scales up to 1100 jobs
[-] ~~Failover simulation on dev.gitlab.org~~
[-] ~~Redis outage -> Runners recover the partial traces~~
Fixes
- [-] Force archive when jobs are finished. IT shouldn't use Sidekiq worker as it could be missing by Sidekiq inconsistency.
- [-] [Use ObjectStorage for new CI Job live-trace architecture](https://gitlab.com/gitlab-org/gitlab- ce/issues/45712)
- Implement a CronWorker (Archive leftover periodically)
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18969
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/19101
Enable live trace on staging.gitlab.com

Those items have been moved to the next iteration https://gitlab.com/gitlab-org/gitlab-ce/issues/47125

[-] ~~Enable live trace on gitlab.com for short time (10~20 minutes)~~
- [-] ~~Enable ci_enable_live_trace feature flag (default: off)~~
- [-] ~~Monitor the load (Garafana, Sentry, etc)~~
- [-] ~~Verify the data correctness if possible~~
[-] ~~Enable live trace on gitlab.com by default~~
- [-] ~~TBD~~

Edited Jun 08, 2018 by Shinya Maeda

Assignee Loading

Time tracking Loading