2023-04-17: [GSTG][main db] Provision a new PG14 cluster
<!-- Please review https://about.gitlab.com/handbook/engineering/infrastructure/change-management/ for the most recent information on our change plans and execution policies. --> # Production Change ### Change Summary Provisions a new cluster via TF, previously attempted via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5439. Which was reverted via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5453 due to not following the Change Management process. Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16419. ### Change Details 1. **Services Impacted** - {+ List services +} 1. **Change Technician** - `@anganga` 1. **Change Reviewer** - @ahmadsherif 1. **Time tracking** - 30 minutes 1. **Downtime Component** - none ## Detailed steps for the change ### Change Steps - steps to take to execute the change *Estimated Time to Complete (mins)* - {+Estimated Time to Complete in Minutes+} - [ ] Set label ~"change::in-progress" `/label ~change::in-progress` - [ ] Merge the following MRs - [ ] https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3196 - [ ] https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5469 - [ ] Set label ~"change::complete" `/label ~change::complete` ## Rollback ### Rollback steps - steps to be taken in the event of a need to rollback this change *Estimated Time to Complete (mins)* - 10 minutes - [ ] Revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5469 - [ ] Set label ~"change::aborted" `/label ~change::aborted` ## Monitoring ### Key metrics to observe <!-- * Describe which dashboards and which specific metrics we should be monitoring related to this change using the format below. --> - Metric: patroni Service Error Ratio and pgbouncer SLI Error Ratio - Location: https://dashboards.gitlab.net/goto/RKW3kRPVz?orgId=1 - What changes to this metric should prompt a rollback: elevated error rates for pgbouncer ## Change Reviewer checklist <!-- To be filled out by the reviewer. --> ~C4 ~C3 ~C2 ~C1: - [ ] Check if the following applies: - The **scheduled day and time** of execution of the change is appropriate. - The [change plan](#detailed-steps-for-the-change) is technically accurate. - The change plan includes **estimated timing values** based on previous testing. - The change plan includes a viable [rollback plan](#rollback). - The specified [metrics/monitoring dashboards](#key-metrics-to-observe) provide sufficient visibility for the change. ~C2 ~C1: - [ ] Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). - The change plan includes success measures for all steps/milestones during the execution. - The change adequately minimizes risk within the environment/service. - The performance implications of executing the change are well-understood and documented. - The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? - The change has a primary and secondary SRE with knowledge of the details available during the change window. - The labels ~"blocks deployments" and/or ~"blocks feature-flags" are applied as necessary ## Change Technician checklist <!-- To find out who is on-call, in #production channel run: /chatops run oncall production. --> - [ ] Check if all items below are complete: - The [change plan](#detailed-steps-for-the-change) is technically accurate. - This Change Issue is linked to the appropriate Issue and/or Epic - Change has been tested in staging and results noted in a comment on this issue. - A dry-run has been conducted and results noted in a comment on this issue. - The change execution window respects the [Production Change Lock periods](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#production-change-lock-pcl). - For ~C1 and ~C2 change issues, the change event is added to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar. - For ~C1 and ~C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention `@sre-oncall` and this issue and await their acknowledgement.) - For ~C1 and ~C2 change issues, the SRE on-call provided approval with the ~eoc_approved label on the issue. - For ~C1 and ~C2 change issues, the Infrastructure Manager provided approval with the ~manager_approved label on the issue. - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention `@release-managers` and this issue and await their acknowledgment.) - There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Incident%3A%3AActive) that are ~severity::1 or ~severity::2 - If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
issue