Skip to content

Enable new live trace architecture on production for a short period and measure the performance impacts

History

Third evaluation

Disabled on 2019-07-16 13:24 UTC because some traces are missing https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#note_192334395

Second evaluation


First evaluation

We shipped the feature - New live trace architecture in %11.0. This feature has to be enabled when we move to GKE from monolith (because we can't store production data to local file storage), however, it's not enabled on production yet as it had a few performance concerns at that time.

In %11.1 , we improved the feature significantly and resolved all concerns. In addition, It's been evaluated on dev.gitlab.org and staging.gitlab.com for 2 months. So far there are no problems. It's running steadily.

Now it's time to enable this feature on production. In this first time, we'll enable this feature for a short period (e.g. 1 hour) and measure its performance impacts.

How to enable new live trace architecture

To enable the feature, we flips the feature flag via Feature.enable('ci_enable_live_trace').

During the period, we observe related metrics/clash reports through Grafana/Sentry/Kibana.

After the metrics collection is done, we'll disable feature via Feature.disable('ci_enable_live_trace'), and discuss if we need any further improvements.

Metrics to look at when enabled


The definition of DONE in this issue is enabling this feature for a week without having problems/performance degradation.

  • Enabling this feature for 1 hour, and confirmed it didn't occur any problems/performance degradation
  • Enabling this feature for 1 day, and confirmed it didn't occur any problems/performance degradation
  • Enabling this feature for 1 week, and confirmed it didn't occur any problems/performance degradation

A separate issue was created for project board tracking purposes - gitlab-org/gitlab#217988 (closed)

Edited by Craig Gomes