[GPRD] - Decomposition Phase 3 Serve CI reads from CI standby cluster
Production Change
Change Summary
In Phase 3 read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica .
Previous equivalent rollout for staging was handled at gitlab-org/gitlab#345118 (closed) and most of the steps in this issue are copied from gitlab-org/gitlab#351568 (closed) and converted into change request format.
Change Details
- Services Impacted - ServicePostgres ServiceAPI ServiceWeb ServiceSidekiq
- Change Technician - @DylanGriffith @tkuah
- Change Reviewer - @tkuah
- Time tracking - 1 week
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 30 minutes
-
Set label changein-progress on this issue -
Set the following environment variables for all gitlab-railshosts:-
GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=mainto makemain/ci:to share the same primary connection -
GITLAB_MULTIPLE_DATABASE_METRICS=trueto enabledb_config_namein Prometheus metrics to indicate used database' - MR:
- previous staging MRs:
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/968
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985
- New MR:
- previous staging MRs:
-
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1 week
-
Configure chef for gitlab-rails console node to config/database.ymlfor multiple databases- MR: (previous staging MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/949)
-
Run validation commands on console gitlab-org/gitlab#351568 (closed) -
Configure CNG for remaining gitlab-rails canary nodes - MR:
- previous staging MRs:
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985
-
Configure CNG for remaining gitlab-rails nodes - MR:
- previous staging MRs:
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985)
- New MR:
-
Confirm monitoring expectations from gitlab-org/gitlab#351568 (closed) -
Enable 0.01%foruse_model_load_balancingFFFeature.enable_percentage_of_time(:use_model_load_balancing, 0.01)
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed) -
Enable 1%foruse_model_load_balancing/chatops run feature set use_model_load_balancing 1 --random
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed) -
Enable 10%foruse_model_load_balancing/chatops run feature set use_model_load_balancing 10 --random
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed) -
Enable 20%foruse_model_load_balancing/chatops run feature set use_model_load_balancing 20 --random
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed) -
Enable 50%foruse_model_load_balancing/chatops run feature set use_model_load_balancing 50 --random
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed) -
Enable 100%foruse_model_load_balancing/chatops run feature set use_model_load_balancing 100 --random
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10
-
Monitor all metrics from gitlab-org/gitlab#351568 (closed)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
-
Disable the feature flag use_model_load_balancing -
Only if disabling the feature flag does not work: - Revert MRs which set the environment variables and the
config/database.yml
- Revert MRs which set the environment variables and the
Monitoring
Key metrics to observe
- Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
- Metric: Patroni CI Dashboard
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1
- What changes to this metric should prompt a rollback: High Error Ratio or Saturation
- Metric: Logs and Prometheus metrics
- Location: gitlab-org/gitlab#351568 (closed)
- What changes to this metric should prompt a rollback: Unexplained high number of requests to
cidatabase. We should see proportional growth to the feature flag rollout
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. - Dry-run is part of the process. Firstly the MRs show dry-run output then we also make the changes on the console node first and test it there.
-
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Dylan Griffith (ex GitLab)