Performance improvements of new live trace architecture
We shipped GA new live trace architecture in %11.0. We have a plan to improve this architecture to make it durable enough on production (The number of running jobs are around 1000 ~ 1500).
Changes
-
Use ObjectStorage for new CI Job live-trace architecture
This is necessary if it turned out DB load is unacceptable on production
Test, monitor, and metrics
-
Fundamental tests on dev.gitlab.org -
Verify the data correctness. We can compare 1) Incomming full-trace from runners to 2) Presisted live-trace.
-
-
Failover simulation -
Redis outage -> Runners recover the partial traces
-
-
Enable live trace on gitlab.com for short time (10~20 minutes) -
Enable ci_enable_live_trace
feature flag (default: off) -
Monitor the load (Garafana, Sentry, etc) -
Verify the data correctness if possible -
Check the performance of rescue_live_trace_worker (Especially performance of sql query) -
Check kibana if ther are any Failed to archive stale live trace.
-
Edited by Shinya Maeda