Enable new live trace architecture on production for a short period and measure the performance impacts
History
-
2019-10-23 13:04 UTC:
This feature has been disabled for Redis OOM incident. -
2019-08-16 13:14 UTC:
This feature has been enabled on gitlab-org/gitlab-ce, gitlab-org/gitlab-ee and gitlab-com/www-gitlab-com for evaluating a patch. -
2019-07-16 13:24 UTC:
This feature has been disabled for investigating trace loss. -
2019-07-12 17:15 UTC:
This feature has been enabled for the third evaluation. -
2019-04-18 07:13 UTC:
This feature has been disabled for investigating trace loss. -
2019-01-14 04:51 UTC:
This feature has been enabled for the second evaluation.
Third evaluation
- Reason: https://gitlab.com/gitlab-org/gitlab-ce/issues/60678#note_185197197
- Date: Jun 2019
-
Enabling this feature for 1 week, and confirmed it didn't occur any problems/performance degradation
Disabled on 2019-07-16 13:24 UTC because some traces are missing https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4667#note_192334395
Second evaluation
- Reason: https://gitlab.com/gitlab-org/gitlab-ce/issues/51502
- Date: Jan 2019
-
Enabling this feature for 1 week, and confirmed it didn't occur any problems/performance degradation
First evaluation
We shipped the feature - New live trace architecture in %11.0. This feature has to be enabled when we move to GKE from monolith (because we can't store production data to local file storage), however, it's not enabled on production yet as it had a few performance concerns at that time.
In %11.1 , we improved the feature significantly and resolved all concerns. In addition, It's been evaluated on dev.gitlab.org and staging.gitlab.com for 2 months. So far there are no problems. It's running steadily.
Now it's time to enable this feature on production. In this first time, we'll enable this feature for a short period (e.g. 1 hour) and measure its performance impacts.
How to enable new live trace architecture
To enable the feature, we flips the feature flag via Feature.enable('ci_enable_live_trace')
.
During the period, we observe related metrics/clash reports through Grafana/Sentry/Kibana.
After the metrics collection is done, we'll disable feature via Feature.disable('ci_enable_live_trace')
, and discuss if we need any further improvements.
Metrics to look at when enabled
- https://dashboards.gitlab.net/d/000000126/grape-endpoints?orgId=1&var-action=Grape%23PATCH%20%2Fapi%2Fjobs%2F:id%2Ftrace&var-database=influxdb-01-inf-gprd
- https://dashboards.gitlab.net/d/000000126/grape-endpoints?orgId=1&var-action=Grape%23PUT%20%2Fapi%2Fjobs%2F:id&var-database=influxdb-01-inf-gprd
- https://dashboards.gitlab.net/d/thYzurImk/rails-controllers?orgId=1&var-action=Projects::JobsController%23trace.json&var-database=influxdb-01-inf-gprd
- https://dashboards.gitlab.net/d/000000124/sidekiq-workers?orgId=1&var-worker=Ci::BuildTraceChunkFlushWorker%23perform&var-database=influxdb-01-inf-gprd
- https://dashboards.gitlab.net/d/000000124/sidekiq-workers?orgId=1&var-worker=ArchiveTraceWorker%23perform&var-database=influxdb-01-inf-gprd
- https://sentry.gitlab.net/gitlab/gitlabcom/?query=trace
The definition of DONE in this issue is enabling this feature for a week without having problems/performance degradation.
-
Enabling this feature for 1 hour, and confirmed it didn't occur any problems/performance degradation -
Enabling this feature for 1 day, and confirmed it didn't occur any problems/performance degradation -
Enabling this feature for 1 week, and confirmed it didn't occur any problems/performance degradation
A separate issue was created for project board tracking purposes - gitlab-org/gitlab#217988 (closed)