Enabling of Praefect's reads distribution feature flag
Production Change
Change Component | Description |
---|---|
Change Objective | Verification of the work of the new feature protected with feature flag |
Change Type | DeploymentNewFeature |
Services Impacted | Praefect |
Change Technician | Pavlo Strokov @8bitlife |
Change Criticality | C4 |
Change Type | changeunscheduled |
Change Reviewer | {+ bjk@gitlab.com |
Due Date | Date and time in UTC timezone for the execution of the change |
Time tracking | 5 min |
Downtime Component | no |
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 2
-
Enable feature flag for 5% of requests /chatops run feature set gitaly_distributed_reads 5
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 15
-
Verify error rate for Praefect {+https://dashboards.gitlab.net/d/praefect-main/praefect-overview?viewPanel=5&orgId=1+} -
Verify reads distribution is working with new grafana panel that shows sum(rate(gitaly_praefect_read_distribution{environment="gprd"}[5m])) by (virtual_storage, storage)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 2
-
/chatops run feature delete gitaly_distributed_reads
Monitoring
Key metrics to observe
- Metric: gitlab_service_errors
- Location: {+https://dashboards.gitlab.net/d/praefect-main/praefect-overview?viewPanel=5&orgId=1+}
- What changes to this metric should prompt a rollback: Too high error rate
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.
Edited by Pavlo Strokov