Skip to content
GitLab
Next
    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    Projects Groups Snippets
  • Register
  • Sign in
  • P production
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 101
    • Issues 101
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1
    • Merge requests 1
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.comGitLab.com
  • GitLab Infrastructure TeamGitLab Infrastructure Team
  • production
  • Issues
  • #2812
Closed
Open
Issue created Oct 12, 2020 by Grzegorz Bizon@grzesiek💡Developer3 of 13 checklist items completed3/13 checklist items

Cloud native build logs - 10% gitlab.com rollout

Production Change

Change Summary

We are working on rollout the cloud native build logs feature to production.

  • This effort is described in an architectural blueprint ➡ https://docs.gitlab.com/ee/architecture/blueprints/cloud_native_build_logs/
  • Rollout issue ➡ gitlab-org/gitlab#241471 (closed)

Notes:

  • This feature has been successfully enabled in gitlab-org/gitlab a few weeks ago, and is working fine there
  • This feature has been successfully enabled in gitlab-com/www-gitlab-com project and behaves well there

Change Details

  1. Services Impacted - Redis, APIs
  2. Change Technician - @grzesiek
  3. Change Criticality - C2
  4. Change Type - changeunscheduled
  5. Change Reviewer - TBD
  6. Due Date - 2020-10-12
  7. Time tracking - 1 hour
  8. Downtime Component - No

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 minute

  • /chatops run feature set ci_enable_live_trace 5 --actors
  • /chatops run feature set ci_enable_live_trace 10 --actors

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30 minutes

  • Check correctness of build traces using gitlab_ci_trace_operations_total Prometheus metric

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1 minute

  • /chatops run feature set ci_enable_live_trace false
  • /chatops run feature set ci_enable_live_trace 0 --actors

Monitoring

  • Sentry errors containing "trace" keyword -> https://sentry.gitlab.net/gitlab/gitlabcom/?query=trace
  • API dashboard for build status / trace operations - PUT /api/jobs/:id / PATCH /api/jobs/:id/trace
  • Build details page -> GET trace.json / GET raw
  • Redis memory -> Redis Overview Dashboard

Key metrics to observe

  • Metric: gitlab_ci_trace_operations_total
    • Location: Prometheus
    • What changes to this metric should prompt a rollback: a lot of invalid traces counted.

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and resultes noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited Oct 12, 2020 by Grzegorz Bizon
Assignee
Assign to
Time tracking