Skip to content

Testing CI Traffic going back to Main in preparation in case of CI decomposition rollback

Production Change

Change Summary

As part of our upcoming CI decomposition event we want to be ready with a safe and tested rollback procedure . The rollback plan is currently gitlab-org/gitlab#361759 (closed) but there are some risks with moving more read-only traffic back to the Main cluster. We are especially concerned about the risk of scrambling to do this for the first time on the weekend when we have no other options. As such we want it to be tested in a safe way on production ahead of time where it can be easily reverted in case of an issue.

So we will test this by:

  1. Updating replicas in Patroni Main cluster to advertise as ci-db-replica (via Consul configuration) as well as db-replica so they will receive some CI traffic
  2. Updating replicas in Patroni CI cluster to not advertise as ci-db-replica (via Consul configuration) so that they do not receive any CI read-only traffic and then it all goes to Main
  3. Wait 1-2 weeks
  4. Revert the first 2 steps so we're back to read-only queries going to ci-db-replica and we've confirmed that we can safely rollback reads to the Main cluster without overloading it

The overall change request will be open for ~2 weeks but we'll only mark changein-progress when we're making the changes. When the change request is finished the production configuration will be back to where it started as this change request reverts itself.

Change Details

  1. Services Impacted - ServicePatroni ServicePostgres
  2. Change Technician - @gsgl
  3. Change Reviewer - @Finotto @rhenchen.gitlab
  4. Time tracking - 2 weeks
  5. Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change (2 weeks mostly waiting, 1 hour active time)

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  1. Set label changein-progress /label ~change::in-progress
  2. Add consul.port_service_name_overrides to Patroni Main
    1. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1952
    2. Wait until the change has been deployed to Patroni main DB instances (or trigger a chef-client with knife)
  3. Observe that CI reads are going to Patroni Main as well now
  4. Wait 3 working days
  5. Change "consul.service_name": "dormant-ci-db-replica" on Patroni CI
    1. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1875
  6. Once chef has run on Patroni CI hosts you'll need to explicitly delete the /etc/consul/conf.d/ci-db-replica*.json files as Chef does not clean these up properly
    • knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo rm -f /etc/consul/conf.d/ci-db-replica*.json'
    • knife ssh -C 10 'roles:gprd-base-db-patroni-ci' 'sudo consul reload'
  7. Observe that no reads are going to Patroni CI (except baseline exporter reads)
  8. Ensure that we have resolved #7250 (closed) and if we are going to rebuild patroni-ci1 then we should have rebuilt this first as there is some risk that building patroni-ci1 caused this incident and building it again may cause the incident again
  9. Wait 1 week
  10. Revert the MR for Patroni CI (Important you must ensure this is deployed before changing Patroni Main)
  11. Confirm that all patroni + patroni-ci hosts resolve for ci-db-replica.service.consul
    • dig ci-db-replica.service.consul +short SRV | sort -k 4
  12. Observe that CI reads are going to main + CI
  13. Revert the MR for Patroni Main -
  14. Once the MR has been deployed, run chef-client on the patroni main hosts:
    • knife ssh -C 5 'roles:gprd-base-db-patroni-v12' 'sudo chef-client'
  15. Confirm that ci-db-replica.service.consul resolves only to patroni-ci hosts:
    • dig ci-db-replica.service.consul +short SRV | sort -k 4
  16. Observe that CI reads are going to Patroni CI only
  17. Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 mins

  • Rollback the MRs merged above
  • To speed up the rollback run chef on the affected Patroni CI and Patroni Main hosts
  • Once chef has run on Patroni CI hosts you'll need to explicitly delete the /etc/consul/conf.d/dormant-ci-db-replica*.json files as Chef does not clean these up properly
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Gonzalo Servat