Skip to content

Performance improvements of new live trace architecture

We shipped GA new live trace architecture in %11.0. We have a plan to improve this architecture to make it durable enough on production (The number of running jobs are around 1000 ~ 1500).

Changes

Test, monitor, and metrics

  • Fundamental tests on dev.gitlab.org
    • Verify the data correctness. We can compare 1) Incomming full-trace from runners to 2) Presisted live-trace.
  • Failover simulation
    • Redis outage -> Runners recover the partial traces
  • Enable live trace on gitlab.com for short time (10~20 minutes)
    • Enable ci_enable_live_trace feature flag (default: off)
    • Monitor the load (Garafana, Sentry, etc)
    • Verify the data correctness if possible
    • Check the performance of rescue_live_trace_worker (Especially performance of sql query)
    • Check kibana if ther are any Failed to archive stale live trace.
Edited by Shinya Maeda