Test cloud native build logs in `gitlab-org/gitlab` project for 24 hours
Production Change
Change Component | Description |
---|---|
Change Objective | Performance tests of a redesigned feature (cloud native build logs) |
Change Type | Feature Redesign |
Services Impacted | Redis, GitLab CI/CD |
Change Technician | @grzesiek, SRE on call |
Change Criticality | C2 |
Change Type | changescheduled |
Change Reviewer | @bjk-gitlab |
Due Date | 2020-08-12 |
Time tracking | 5 hours |
Downtime Component | No downtime |
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete - 1 hour
-
Write a message about the rollout in #whats-happening-at-gitlab
Slack channel 1 hour before we start
Change Steps - steps to take to execute the change
Estimated Time to Complete - 4 hour
-
/chatops run feature set --project=gitlab-org/gitlab ci_enable_live_trace true
-
Monitor Redis Overview metrics for around 4 - 8 hours -
Manage the production change hand-off to another SRE -
Leave the feature flag on for GitLab project for 24 hours -
/chatops run feature set --project=gitlab-org/gitlab ci_enable_live_trace false
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete - 1 hour
-
Take screenshots of metrics, post results to gitlab-org/gitlab#217988 (closed)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete 10 minutes
-
Disable the feature flag /chatops run feature set --project=gitlab-org/gitlab ci_enable_live_trace false
-
Cancel running pipelines in `gitlab-org/gitlab' (optionally)
Monitoring
Key metrics to observe
- Metric: Sentry error for build traces
- Location: sentry errors
- What changes to this metric should prompt a rollback: a lot of Redis / Exclusive Lock / Traces related exceptions
- Metric: API endpoint for build logs
- Location: API for job traces dashboard
- What changes to this metric should prompt a rollback: noticeable spikes in errors or latency
- Metric: Redis overview
- Location: Redis overview dashboard
- What changes to this metric should prompt a rollback: memory consumption too high, CPU saturation
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
This change might involve additional Redis usage.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.
Edited by Grzegorz Bizon