General availability on New Live Trace architecture
We shipped the new live trace architecture(beta) in %10.8 . This feature is off by default. Here is the architecture details https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/administration/job_traces.md#new-live-trace-architecture
This is functional, however, there are some problems which hold us back from enabling this on production server.
- This feature could incur data loss:
- Case 1: When all data in Redis are accidentally flushed.
On-going live traces could be recovered by re-sending traces (This is supported by all versions of GitLab Runner)
Finished jobs which has not archived live traces will lose the last part (~128kB) of trace data.
- Case 2: When sidekiq workers failed to archive (e.g. There was a bug that prevents archiving process, Sidekiq inconsistancy, etc):
Currently all trace data in Redis will be deleted after one week. If the sidekiq workers have not finished by the expiry date, the last part of trace data (~128kB) will be lost.
- This feature could eat up a lot of memory on Redis instance and kill it by OOM. e.g. 10000 running jobs => 128KB * 10000 = 1.28 GB consumed.
- We can prevent it by setting api limit, but this might not be a good solution.
- We should have a way to mitigate the problem without having downtime
We can stop the growth of memory consumption by disabling this feature. It takes the effect immidiately after we fliped the feature flag (
Feature.disable('ci_enable_live_trace')). After it's disabled, the total memory concomption will be declining slowly by releasing (archiving) the allocated data little by little.
- This feature could pressure Database replication lag.
INSERTare generated to indicate that we have trace chunk.
UPDATEwith 128kB of data is issued once we receive multiple chunks.
Fundamental tests on dev.gitlab.org
- Unit/Integration tests are defined well, however we have to blackbox-test with an environment which has a similar spec with the production server
- Live traces should be visible on job pages
- Archived traces should be visible on job pages
- Live traces should be archived to Object storage
- Live traces should be cleaned up after archived
- [-] ~~Verify the data correctness. We can compare 1) Incomming full-trace from runners to 2) Presisted live-trace. ~~
- Test feature flag on/off and confirm we can pull the plug at any time
- Stress tests on dev.gitlab.org
- Schedule 100 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load,
- Schedule 1000 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load,
Schedule 10000 jobs with 1MB trace and process concurrently. Measure memory consumption on Redis, DB load, etc.=> dev.gitlab.org only scales up to 1100 jobs
Failover simulation on dev.gitlab.org
Redis outage -> Runners recover the partial traces
- [-] Force archive when jobs are finished. IT shouldn't use Sidekiq worker as it could be missing by Sidekiq inconsistency.
- [-] [Use ObjectStorage for new CI Job live-trace architecture](https://gitlab.com/gitlab-org/gitlab- ce/issues/45712)
- Implement a CronWorker (Archive leftover periodically)
- Enable live trace on staging.gitlab.com
Those items have been moved to the next iteration https://gitlab.com/gitlab-org/gitlab-ce/issues/47125
Enable live trace on gitlab.com for short time (10~20 minutes)
ci_enable_live_tracefeature flag (default: off)
Monitor the load (Garafana, Sentry, etc)
Verify the data correctness if possible
Enable live trace on gitlab.com by default