[GPRD] - Decomposition Phase 3 Serve CI reads from CI standby cluster

Production Change

Change Summary

In Phase 3 read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica .

Previous equivalent rollout for staging was handled at gitlab-org/gitlab#345118 (closed) and most of the steps in this issue are copied from gitlab-org/gitlab#351568 (closed) and converted into change request format.

Change Details

  1. Services Impacted - ServicePostgres ServiceAPI ServiceWeb ServiceSidekiq
  2. Change Technician - @DylanGriffith @tkuah
  3. Change Reviewer - @tkuah
  4. Time tracking - 1 week
  5. Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 30 minutes

  • Set label changein-progress on this issue
  • Set the following environment variables for all gitlab-rails hosts:
    • GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main to make main/ci: to share the same primary connection
    • GITLAB_MULTIPLE_DATABASE_METRICS=true to enable db_config_name in Prometheus metrics to indicate used database'
    • MR:
      • previous staging MRs:
        • https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/968
        • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
        • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
        • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
        • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
        • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
        • https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985
      • New MR:
        • Chef-repo https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1529
        • gitlab-com gitlab-com/gl-infra/k8s-workloads/gitlab-com!1589 (merged)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 week

  1. Configure chef for gitlab-rails console node to config/database.yml for multiple databases
    1. MR: (previous staging MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/949)
      • New MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1534
  2. Run validation commands on console gitlab-org/gitlab#351568 (closed)
  3. Configure CNG for remaining gitlab-rails canary nodes
    1. MR:
    • previous staging MRs:
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
      • https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985
      1. New MR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1595 (merged)
  4. Configure CNG for remaining gitlab-rails nodes
    1. MR:
    • previous staging MRs:
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1364 (merged)
      • gitlab-com/gl-infra/k8s-workloads/gitlab-com!1358 (merged)
      • https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/985)
      1. New MR:
        • gitlab-com gitlab-com/gl-infra/k8s-workloads/gitlab-com!1592 (merged)
  5. Confirm monitoring expectations from gitlab-org/gitlab#351568 (closed)
  6. Enable 0.01% for use_model_load_balancing FF
    1. Feature.enable_percentage_of_time(:use_model_load_balancing, 0.01)
  7. Monitor all metrics from gitlab-org/gitlab#351568 (closed)
  8. Enable 1% for use_model_load_balancing
    1. /chatops run feature set use_model_load_balancing 1 --random
  9. Monitor all metrics from gitlab-org/gitlab#351568 (closed)
  10. Enable 10% for use_model_load_balancing
    1. /chatops run feature set use_model_load_balancing 10 --random
  11. Monitor all metrics from gitlab-org/gitlab#351568 (closed)
  12. Enable 20% for use_model_load_balancing
    1. /chatops run feature set use_model_load_balancing 20 --random
  13. Monitor all metrics from gitlab-org/gitlab#351568 (closed)
  14. Enable 50% for use_model_load_balancing
    1. /chatops run feature set use_model_load_balancing 50 --random
  15. Monitor all metrics from gitlab-org/gitlab#351568 (closed)
  16. Enable 100% for use_model_load_balancing
    1. /chatops run feature set use_model_load_balancing 100 --random

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10

  1. Monitor all metrics from gitlab-org/gitlab#351568 (closed)

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

  • Disable the feature flag use_model_load_balancing
  • Only if disabling the feature flag does not work:
    • Revert MRs which set the environment variables and the config/database.yml

Monitoring

Key metrics to observe

  • Metric: Sentry Errors
    • Location: https://sentry.gitlab.net/gitlab/gitlabcom/
    • What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
  • Metric: Patroni CI Dashboard
    • Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1
    • What changes to this metric should prompt a rollback: High Error Ratio or Saturation
  • Metric: Logs and Prometheus metrics
    • Location: gitlab-org/gitlab#351568 (closed)
    • What changes to this metric should prompt a rollback: Unexplained high number of requests to ci database. We should see proportional growth to the feature flag rollout

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
  • The change plan includes success measures for all steps/milestones during the execution.
  • The change adequately minimizes risk within the environment/service.
  • The performance implications of executing the change are well-understood and documented.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
  • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
    1. Dry-run is part of the process. Firstly the MRs show dry-run output then we also make the changes on the console node first and test it there.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited Mar 18, 2022 by Dylan Griffith (ex GitLab)
Assignee Loading
Time tracking Loading