Skip to content

Rollout - Cloud Native Build Logs - Gitlab.com

Description

This issue describes an incremental rollout strategy we are going to use to enable Cloud Native Build Logs for everyone on gitlab.com

Checklist

Item State DRI
Check if merge requests with improvements are in production @grzesiek
Validate observability metrics for trace rate, mutation, overwrite @grzesiek
   –› Fix build trace rate metric done in %13.5 @grzesiek
Validate acceptable traces mechanism in `grzesiek/live-traces-sandbox` project @grzesiek
Validate build status ACCEPTED 202 mechanism using logs and endpoint metrics @grzesiek
   –› Parse CRC32 checksum provided in hexadecimal done in %13.5 @grzesiek
Validate build status update exponential backoff mechanism @grzesiek
   –› Ensure that runner exponential backoff is an integer done in %13.5 @grzesiek
Enable cloud native build logs in `gitlab-org/gitlab` project @grzesiek
   –› 👣 gitlab-org/gitlab rollout production change done in %13.5 @grzesiek, @ahmadsherif
   –› Extend exception about chunk data not fulfilled in a bucket done in %13.5 @grzesiek
   –› Make build trace correctness validation sticky done in %13.5 @grzesiek
   –› Retry a build trace chunk migration in case of an exception done in %13.5 @grzesiek
   –› Log detected invalid build trace chunks done in %13.5 @grzesiek
   –› Use optimistic locking to safely migrate a build trace chunk done in %13.5 @grzesiek
Enable cloud native build logs in `gitlab-com/www-gitlab-com` project @grzesiek
   –› 👣 gitlab-com/www-gitlab-com rollout production change done in %13.5 @grzesiek, @igorwwwwwwwwwwwwwwwwwwww
Rollout to 5% actors on Gitlab.com @grzesiek
   –› 👣 5% rollout production change done in %13.5 @grzesiek, @igorwwwwwwwwwwwwwwwwwwww
Rollout to 10% actors on Gitlab.com @grzesiek
   –› 👣 10% rollout production change done in %13.5 @grzesiek, @igorwwwwwwwwwwwwwwwwwwww
   –› Resolve live trace read race condition using a retry done in %13.5 @grzesiek
   –› Reduce the noise generated by locked chunk migration done in %13.5 @grzesiek
   –› Delay archive trace operation to fix race condition done in %13.5 @grzesiek
   –› Add build trace chunks migration duration histogram metric done in %13.6 @grzesiek
Rollout to 25% actors on Gitlab.com @grzesiek
   –› 👣 25% rollout production change done in %13.6 @grzesiek, @hphilipps
   –› Improve trace finalize histogram buckets done in %13.6 @grzesiek
   –› Fix NoMethodError when chunks are being removed done in %13.6 @grzesiek
Rollout to 60% actors on Gitlab.com @grzesiek
   –› 👣 60% rollout production change done in %13.6 @grzesiek, @hphilipps
   –› Add Grape content logger to log content length and range done in %13.6 @grzesiek
Rollout to 80% actors on Gitlab.com @grzesiek
   –› 👣 80% rollout production change done in %13.6 @grzesiek, @hphilipps
   –› Deduplicate build trace chunks flush worker done in %13.6 @nmilojevic1
100% rollout on Gitlab.com @grzesiek
   –› 👣 100% rollout production change done in %13.6 @grzesiek

Metrics

New metrics exposed in Prometheus:

  • gitlab_ci_trace_operations_total
  • gitlab_ci_trace_rate_bytes

Logs

KQLs:

# PATCH trace
json.meta.project : "grzesiek/live-traces-sandbox" and json.method: "PATCH" and json.route: "/api/:version/jobs/:id/trace"

# PUT job
json.meta.project : "grzesiek/live-traces-sandbox" and json.method: "PUT" and json.route: "/api/:version/jobs/:id"

Feature Flags

  • ci_enable_live_trace - main feature flag to enable / disable cloud native build logs
  • ci_accept_trace - feature flag for the new mechanism responsible for validating traces
Edited by Grzegorz Bizon