Testing CI Traffic going back to Main in preparation in case of CI decomposition rollback
Production Change
Change Summary
As part of our upcoming CI decomposition event we want to be ready with a safe and tested rollback procedure . The rollback plan is currently gitlab-org/gitlab#361759 (closed) but there are some risks with moving more read-only traffic back to the Main cluster. We are especially concerned about the risk of scrambling to do this for the first time on the weekend when we have no other options. As such we want it to be tested in a safe way on production ahead of time where it can be easily reverted in case of an issue.
So we will test this by:
- Updating replicas in Patroni Main cluster to advertise as
ci-db-replica
(via Consul configuration) as well asdb-replica
so they will receive some CI traffic - Updating replicas in Patroni CI cluster to not advertise as
ci-db-replica
(via Consul configuration) so that they do not receive any CI read-only traffic and then it all goes to Main - Wait 1-2 weeks
- Revert the first 2 steps so we're back to read-only queries going to
ci-db-replica
and we've confirmed that we can safely rollback reads to the Main cluster without overloading it
The overall change request will be open for ~2 weeks but we'll only mark changein-progress when we're making the changes. When the change request is finished the production configuration will be back to where it started as this change request reverts itself.
Change Details
- Services Impacted - ServicePatroni ServicePostgres
- Change Technician - @gsgl @rhenchen.gitlab
- Change Reviewer - @Finotto
- Time tracking - 2 weeks
- Downtime Component - None
Detailed steps for the change
Change Steps - steps to take to execute the change (2 weeks mostly waiting, 1 hour active time)
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Set label changein-progress /label ~change::in-progress
-
Add "consul.additional_service_names": ['ci-db-replica']
to Patroni Main-
MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1874 -
Wait until the change has been deployed to Patroni main DB instances (or trigger a chef-client
withknife
)
-
-
Observe that CI reads are going to Patroni Main as well now -
Wait 1 working day -
Change "consul.service_name": "dormant-ci-db-replica"
on Patroni CI -
Once chef has run on Patroni CI hosts you'll need to explicitly delete the /etc/consul/conf.d/ci-db-replica*.json
files as Chef does not clean these up properly-
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo rm -f /etc/consul/conf.d/ci-db-replica*.json'
-
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo consul reload'
-
-
Observe that no reads are going to Patroni CI (except baseline exporter reads) -
Wait 1 week -
Revert the MR for Patroni CI (Important you must ensure this is deployed before changing Patroni Main) -
MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1890 -
Ensure it is deployed -
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo rm -f /etc/consul/conf.d/dormant-ci-db-replica*.json'
-
knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo consul reload'
-
-
Confirm that all patroni + patroni-ci hosts resolve for ci-db-replica.service.consul
-
dig ci-db-replica.service.consul +short SRV | sort -k 4
-
-
Observe that CI reads are going to main + CI -
Revert the MR for Patroni Main -
Once the MR has been deployed, run chef-client on the patroni main hosts: -
knife ssh -C 5 'roles:gprd-base-db-patroni-main' 'sudo chef-client'
-
-
Confirm that ci-db-replica.service.consul
resolves only to patroni-ci hosts:-
dig ci-db-replica.service.consul +short SRV | sort -k 4
-
-
Observe that CI reads are going to Patroni CI only -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 mins
-
Rollback the MRs merged above -
To speed up the rollback run chef on the affected Patroni CI and Patroni Main hosts -
Once chef has run on Patroni CI hosts you'll need to explicitly delete the /etc/consul/conf.d/dormant-ci-db-replica*.json
files as Chef does not clean these up properly -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
- Metric: Patroni CI Dashboard
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1
- What changes to this metric should prompt a rollback: High Error Ratio or Saturation
- Metric: Logs and Prometheus metrics
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(irate(gitlab_database_decomposition_gitlab_schemas_used%7Benv%3D%22gprd%22%7D%5B10m%5D))%20by(env%2Ctype%2Cdb_config_name%2Cgitlab_schemas)&g0.tab=0&g0.stacked=1&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D https://log.gprd.gitlab.net/goto/66166510-a516-11ec-bd7b-c108343628c3
- What changes to this metric should prompt a rollback: Unexplained queries going to Main or CI.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.