Testing CI Traffic going back to Main in preparation in case of CI decomposition rollback

Production Change

Change Summary

As part of our upcoming CI decomposition event we want to be ready with a safe and tested rollback procedure . The rollback plan is currently gitlab-org/gitlab#361759 (closed) but there are some risks with moving more read-only traffic back to the Main cluster. We are especially concerned about the risk of scrambling to do this for the first time on the weekend when we have no other options. As such we want it to be tested in a safe way on production ahead of time where it can be easily reverted in case of an issue.

So we will test this by:

Updating replicas in Patroni Main cluster to advertise as ci-db-replica (via Consul configuration) as well as db-replica so they will receive some CI traffic
Updating replicas in Patroni CI cluster to not advertise as ci-db-replica (via Consul configuration) so that they do not receive any CI read-only traffic and then it all goes to Main
Wait 1-2 weeks
Revert the first 2 steps so we're back to read-only queries going to ci-db-replica and we've confirmed that we can safely rollback reads to the Main cluster without overloading it

The overall change request will be open for ~2 weeks but we'll only mark changein-progress when we're making the changes. When the change request is finished the production configuration will be back to where it started as this change request reverts itself.

Change Details

Services Impacted - ServicePatroni ServicePostgres
Change Technician - @gsgl
Change Reviewer - @Finotto @rhenchen.gitlab
Time tracking - 2 weeks
Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change (2 weeks mostly waiting, 1 hour active time)

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 mins

Rollback the MRs merged above
To speed up the rollback run chef on the affected Patroni CI and Patroni Main hosts
Once chef has run on Patroni CI hosts you'll need to explicitly delete the /etc/consul/conf.d/dormant-ci-db-replica*.json files as Chef does not clean these up properly
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Total Connections to patroni replicas
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum%20by%20(instance)%20(pg_stat_activity_count%7Benv%3D%22gprd%22%2Cdatname%3D%22gitlabhq_production%22%2C%20tier%3D%22db%22%2C%20type%3D%22patroni%22%7D)%20and%20on(instance)%20pg_replication_is_replica%3D%3D1&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: Connection counts sustained at 330 as this is the limit across all 6 pgbouncers
Metric: Total Connections Wait Time on patroni
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- What changes to this metric should prompt a rollback: Connection wait time increasing beyond normal levels
Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
Metric: Patroni CI Dashboard
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1
- What changes to this metric should prompt a rollback: High Error Ratio or Saturation
Metric: Logs and Prometheus metrics
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(irate(gitlab_database_decomposition_gitlab_schemas_used%7Benv%3D%22gprd%22%7D%5B10m%5D))%20by(env%2Ctype%2Cdb_config_name%2Cgitlab_schemas)&g0.tab=0&g0.stacked=1&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D https://log.gprd.gitlab.net/goto/66166510-a516-11ec-bd7b-c108343628c3
- What changes to this metric should prompt a rollback: Unexplained queries going to Main or CI.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jun 22, 2022 by Gonzalo Servat