[GPRD] Increase pgbouncer pool sizes to reduce saturation and connection wait times (#7381) · Issues · GitLab.com / GitLab Infrastructure Team / Production

[GPRD] Increase pgbouncer pool sizes to reduce saturation and connection wait times

# Production Change ### Change Summary Equivalent `GSTG` change at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7378 This Change is to increase the pgbouncer pool sizes, as discussed at https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/32 During the [CI Decomposition phase 4](https://gitlab.com/groups/gitlab-org/-/epics/6160#phase-4-separate-write-connections-for-ci-and-main-still-going-to-the-same-primary-host) we have split both sync and async pools between `main` and `ci` pgbouncers, to avoid saturation of the `patroni-main` writer node. During this period of split pools we faced [some incidents due to sidekiq pool saturation](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6880), where we increased the pools to reduce the application impact, but we were still [limited by the Writer node resource capacity](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15628#note_1000164201), hence we kept tracking saturation risk for both pools sync and async pools at https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/32 Now having finished the [Phase 7: Promotion of the CI database](https://gitlab.com/groups/gitlab-org/-/epics/7791), we have 2 writer nodes, 1 for `patroni-main` and 1 for `patroni-ci`, therefore we plan to roll out this CR to increase pgbouncer pool sizes gradually, to reduce saturation of pgbouncers but we also aim to let a head room of resources in the Patroni Writer nodes for unexpected spikes. **The [Current pool sizes](https://thanos-query.ops.gitlab.net/graph?g0.expr=min(pgbouncer_databases_pool_size%7Bname%3D~%22gitlabhq_production.*%22%2C%20env%3D%22gprd%22%2Cstage%3D%22main%22%2Ctype%3D~%22pgbouncer.*%22%7D)%20by%20(name%2Ctype)&g0.tab=1&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) are:** - {name="gitlabhq_production", type="pgbouncer"} = 27 - {name="gitlabhq_production", type="pgbouncer-ci"} = 27 - {name="gitlabhq_production_sidekiq", type="pgbouncer"} = 30 - {name="gitlabhq_production_sidekiq", type="pgbouncer-ci"} = 22 **The target of the pool resizing is to find satisfactory values to:** - Connection Saturation per Pool < 80% - Metric at: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7381#evaluate-next-pool-increase - Total Connection Wait Time < 10 seconds - Metric at: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7381#evaluate-next-pool-increase **Different satistactory values can be agreed at every iteraction/round as per https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/32#note_1018539403** ### Change Details 1. **Services Impacted** - ~"Service::Pgbouncer" ~"Service::API" ~"Service::Web" ~"Service::Postgres" ~Database 1. **Change Technician** - @rhenchen.gitlab 1. **Change Reviewer** - @ayufan @DylanGriffith @Finotto 1. **Time tracking** - Multiple Weeks - following agreement at https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/32 1. **Downtime Component** - None ## Detailed steps for the change ### Pre-Change Steps - steps to be completed before execution of the change *Estimated Time to Complete (mins)* - 10 minutes 1. [x] Confirm that `gstg` CR was executed sucessfully - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7378 1. [x] Check that all MRs are rebased 1. [x] Confirm which host is the Patroni Writer for both clusters - Main Cluster Primary Host: - CI Cluster Primary Host: 1. [x] \[optional\] clone https://gitlab.com/rhenchen.gitlab/rhenchen/-/tree/main/scripts and get familiar with the `ssh_cluster_regex.sh` script ### Change Steps - steps to take to execute the change *Estimated Time to Complete (mins)* - 15 minutes (each round) 1. [x] **1st Round (11/07/2022)** 1. [x] During quiet working hours (\~00:00 UTC) 1. [x] Get green light from `@sre-oncall` and `@release-managers` at `#production` Slack channel 1. [x] Set label ~"change::in-progress" `/label ~change::in-progress` 1. [x] Increase pool size limit of Sidekiq PGBouncer pools by 50% -> `Main = 45` and `CI = 33`, as decided at https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/32#note_1017357561 - MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2101 1. [x] Re-run chef in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo chef-client"` 1. [x] Confirm the pool sizing in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""` (check `pool_size`) 1. [x] Set label change complete ~"change::complete" `/label ~change::complete ` 1. [x] Monitor the [key metrics](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7381#monitoring) for 1 week; 1. [x] Discuss necessity of further increase, decrease or rollback to previous stage ; 1. [x] **2nd Round (27/07/2022)** 1. [x] During quiet working hours (after \~22:00 UTC) 1. [x] Get green light from `@sre-oncall` and `@release-managers` at `#production` Slack channel 1. [x] #7519+ 1. [x] Increase pool size limit of SYNC (non-sidekiq) pools by 50% -> `Main = 40` and `CI = 40`, as decided at {+ URL +} - MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2108 1. [x] Re-run chef in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo chef-client"` 1. [x] Confirm the pool sizing in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""` (check `pool_size`) 1. [x] Set label ~"change::complete" `/label ~change::complete` 1. [x] Monitor the [key metrics](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7381#monitoring) for 1 week; 1. [x] Discuss necessity of further increase, decrease or rollback to previous stage ; 1. [x] **3nd Round (not necessary)** ## Rollback ### Rollback steps - steps to be taken in the event of a need to rollback this change *Estimated Time to Complete (mins)* - 15 minutes 1. [ ] Revert MR of the LAST applied round - MR: 3rd round - {+ TODO +} - MR: 2nd round - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2108 - MR: 1st round - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2101 1. [ ] Re-run chef in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo chef-client"` 1. [ ] Confirm the pool sizing in all PGBouncer nodes - Execute: `ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-ci|pgbouncer-sidekiq).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""` (check `pool_size`) 1. [ ] Set label ~"change::aborted" `/label ~change::aborted` ## Monitoring ### Key metrics to observe  #### Rollback Thresholds - Metric: Leader nodes CPU Load (processes per core) - Location: [node_load1](https://thanos-query.ops.gitlab.net/graph?g0.expr=avg_over_time(node_load1%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%7D%5B10m%5D)%20%2F%20instance%3Anode_cpus%3Acount%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: `CPU Load Avg > 0.7` (per core) for 15 minutes or more; - Metric: Leader nodes CPU Usage (% of all CPUs) - Location: [node_cpu_utilization](https://thanos-query.ops.gitlab.net/graph?g0.expr=avg_over_time(instance%3Anode_cpu_utilization%3Aratio%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%7D%5B10m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: avg `CPU utilization > 70%` for 15 minutes or more; - Metric: Leader nodes Memory Trashing (Swap in/out) - Location: [node_vmstat_pswpin](https://thanos-query.ops.gitlab.net/graph?g0.expr=(rate(node_vmstat_pswpin%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%7D%5B10m%5D)%20*%204096)%20%0Aand%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [node_vmstat_pswpout](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_vmstat_pswpout%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%7D%5B10m%5D)%20*%204096%0Aand%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: Spikes of `Swapping activity > 0` for 5 minutes or more; - Metric: Leader nodes I/O wait - Location: [node_disk_read_time_seconds_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_read_time_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20%2F%20rate(node_disk_reads_completed_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device!~%22dm.*%22%7D%5B1m%5D)%20%3E%200%0Aand%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [node_disk_write_time_seconds_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_write_time_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20%2F%20rate(node_disk_writes_completed_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device!~%22dm.*%22%7D%5B1m%5D)%20%3E%200%0Aand%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: avg `I/O wait > 10ms (or 0.01s)` for 2 minutes or more, _but only if caused by an intense I/O activity_; - Metric: Leader nodes I/O Throughput in MB/s - Location: [/dev/sdb node_disk_read_bytes_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_read_bytes_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D), [/dev/sdb node_disk_written_bytes_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_written_bytes_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: `I/O Throughput > 840 MB/s`, 70% of the limit 1,200 MB/s*, for 15 minutes or more; - Metric: Leader nodes IOPS - Location: [/dev/sdb node_disk_reads_completed_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_reads_completed_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [/dev/sdb node_disk_writes_completed_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_disk_writes_completed_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device%3D%22sdb%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: I/O operations per second `IOPS > 70,000`, 70% of the limit of 100,000 iops*, for 15 minutes or more; - Metric: Writer nodes Network throughput - Location: [node_network_receive_bytes_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_network_receive_bytes_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device!%3D%22lo%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [node_network_transmit_bytes_total](https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(node_network_transmit_bytes_total%7Benv%3D%22gprd%22%2Ctype%3D~%22patroni%7Cpatroni-ci%22%2C%20device!%3D%22lo%22%7D%5B1m%5D)%20and%20on%20(fqdn)%20pg_replication_is_replica%3D%3D0&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a rollback: Sustained `Network Throughput > 22.7 Gbps (2.8 GB/s)`, 70% the [VM limit](https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines) of `32 Gbps (4 GB/s)`*, for 15 minutes or more; _* Network and Storage I/O performance limits in `gprd` are based on `SSD (performance) persistent disk` of 28 TBs and `n1-highmem-96` VM with 96 vCPUs, where the I/O bottleneck is the 96vCPU [N1 machine type limits for pd-performance](https://cloud.google.com/compute/docs/disks/performance#machine-type-disk-limits) and not the [block device limits](https://cloud.google.com/compute/docs/disks/performance#type_comparison)_ #### Evaluate next pool increase - Metric: Connection Saturation per Pool - Location: [Main cluster](https://thanos.gitlab.net/graph?g0.expr=clamp_min(clamp_max(sum%20by%20(database%2C%20env%2C%20environment%2C%20shard%2C%20stage%2C%20type)%20(%0A%20%20pgbouncer_pools_server_active_connections%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_testing_connections%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_used_connections%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_login_connections%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%0A)%0A%2F%0Asum%20by%20(database%2C%20env%2C%20environment%2C%20shard%2C%20stage%2C%20type)%20(%0A%20%20label_replace(%0A%20%20%20%20pgbouncer_databases_pool_size%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%7D%2C%0A%20%20%20%20%22database%22%2C%20%22gitlabhq_production_sidekiq%22%2C%20%22name%22%2C%20%22gitlabhq_production_sidekiq%22%0A%20%20)%0A)%0A%2C1)%2C0)&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [CI cluster](https://thanos.gitlab.net/graph?g0.expr=clamp_min(clamp_max(sum%20by%20(database%2C%20env%2C%20environment%2C%20shard%2C%20stage%2C%20type)%20(%0A%20%20pgbouncer_pools_server_active_connections%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_testing_connections%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_used_connections%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%20%2B%0A%20%20pgbouncer_pools_server_login_connections%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%2C%20user%3D%22gitlab%22%2C%20database!%3D%22pgbouncer%22%7D%0A)%0A%2F%0Asum%20by%20(database%2C%20env%2C%20environment%2C%20shard%2C%20stage%2C%20type)%20(%0A%20%20label_replace(%0A%20%20%20%20pgbouncer_databases_pool_size%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%7D%2C%0A%20%20%20%20%22database%22%2C%20%22gitlabhq_production_sidekiq%22%2C%20%22name%22%2C%20%22gitlabhq_production_sidekiq%22%0A%20%20)%0A)%0A%2C1)%2C0)&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a further increase of the pool size: spikes of `pool saturation > 80% (0.8)` for more than 10 minutes; - Metric: Total Connection Wait Time - Location: [Main cluster](https://thanos.gitlab.net/graph?g0.expr=sum%20by%20(database%2C%20environment%2C%20type)%20(rate(pgbouncer_stats_client_wait_seconds_total%7Btype%3D%22pgbouncer%22%2C%20environment%3D%22gprd%22%2C%20database!%3D%22pgbouncer%22%7D%5B1m%5D)%20%2F%20on()%20group_left()%20(vector((time()%20%3C%20bool%201588233600)%20*%201000000)%20%3D%3D%201000000%20or%20vector(1)))&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) , [CI cluster](https://thanos.gitlab.net/graph?g0.expr=sum%20by%20(database%2C%20environment%2C%20type)%20(rate(pgbouncer_stats_client_wait_seconds_total%7Btype%3D%22pgbouncer-ci%22%2C%20environment%3D%22gprd%22%2C%20database!%3D%22pgbouncer%22%7D%5B1m%5D)%20%2F%20on()%20group_left()%20(vector((time()%20%3C%20bool%201588233600)%20*%201000000)%20%3D%3D%201000000%20or%20vector(1)))&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) - What changes to this metric should prompt a further increase of the pool size: spikes of `connection wait time > 10 seconds` at any moment; ## Change Reviewer checklist  ~C4 ~C3 ~C2 ~C1: - [x] Check if the following applies: - The **scheduled day and time** of execution of the change is appropriate. - The [change plan](#detailed-steps-for-the-change) is technically accurate. - The change plan includes **estimated timing values** based on previous testing. - The change plan includes a viable [rollback plan](#rollback). - The specified [metrics/monitoring dashboards](#key-metrics-to-observe) provide sufficient visibility for the change. ~C2 ~C1: - [x] Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). - The change plan includes success measures for all steps/milestones during the execution. - The change adequately minimizes risk within the environment/service. - The performance implications of executing the change are well-understood and documented. - The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? - The change has a primary and secondary SRE with knowledge of the details available during the change window. - The labels ~"blocks deployments" and/or ~"blocks feature-flags" are applied as necessary ## Change Technician checklist  - [x] Check if all items below are complete: - The [change plan](#detailed-steps-for-the-change) is technically accurate. - This Change Issue is linked to the appropriate Issue and/or Epic - Change has been tested in staging and results noted in a comment on this issue. - A dry-run has been conducted and results noted in a comment on this issue. - For ~C1 and ~C2 change issues, the change event is added to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar. - For ~C1 and ~C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention `@sre-oncall` and this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention `@release-managers` and this issue and await their acknowledgment.) - There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Incident%3A%3AActive) that are ~severity::1 or ~severity::2 - If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

issue