[GPRD] [2023-05-27 03:00 UTC] CR - Hardware upgrade of Patroni Primary nodes on CI and Main databases (Switchover)

Production Change

Proposed time: 2023-05-27 (Saturday) 03:00 AM UTC == 2023-05-26 (Friday) 08:00 PM PDT

Change Summary

As part of the rollout plan (see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/18934#steps-to-perform-in-gprd) we already replaced all Replicas from the old n1-highmem-96 VMs by the new generation n2-highmem-128 VMs in the patroni-main cluster, and n2-highmem-96 VMs in the patroni-ci cluster

The last node still running in the old N1 hardware is the current Primary/Writer patroni nodes, and we need to perform a Switchover operation, which consist into a quick change of roles from the current N1 Primary VM into a new N2 VM. This operation will block SQL statements, therefore will cause 50x errors in Gitlab.com while the N2 nodes are promoted and the master endpoints are reconfigured. We expect this operation to take between 30 seconds up to 5 minutes.

This operation is a very low risk of failure and there's no risk of data loss, as the data replication between the nodes will be synchronised before the new node promotion to Leader, and we can switchback the Leader role into the previous VM at any point.

We already have performed this process in GSTG during CR: #8757 (closed)

In this same CR will increase the PostgreSQL max_connections=670 in patroni-main because this parameter setting is made in DCS cluster level and require the instances to restart to be effective on each node, so we are taking the oportunity to avoid a further maintenance. We need to increase max_connections because we plan to reduce the number of nodes for better cost efficiency (see &851 (comment 1286123341)), thereore more workload will be routed on each node. The value of 670 is aproximately 1/3 of the current value of 500, and reflects the CPU count of patroni-main by 1/3 (from 96 to 128 vCPUs per node).

Why we should implement this change as fast as possible?

Since February 2023, Gitlab.com datastore layer is suffering from pg_primary_cpu saturation spikes in our patroni-main database, see https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892. As @tkuah mentioned "Even though the tamland now show shows no forecasted violation, today we had a series of peaks close to 80% CPU. Opened gitlab-org/gitlab#407823 (closed) for this" (https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892#note_1357859469)". With the hardware upgrade we'll be increasing the amount and speed of CPU resources of our patroni-main Primary node, which should reduce considerably the risk of CPU saturation. Therefore, a long wait to implement this change goes agains customers interest.

CSMs/TAMs message to customers

This weekend’s database hardware upgrade had to be rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Unfortunately, there was a long database migration running to fix an incident, that we couldn’t risk interrupting. We apologise for any inconvenience. Next weekend, users may experience temporary 50X errors for a brief period during the database maintenance window. As previously informed, the hardware upgrade is part of a Database Scalability Strategy we are implementing to improve overall database availability and performance of Gitlab.com.

More details can be found at #10694 (closed)


FAQ

Does this maintenance affect GitLab Dedicated customers?

No. It will not impact GitLab Dedicated single-tenancy environments. This maintenance targets only GitLab.com shared infrastructure.


Change Details

  1. Services Impacted - ServicePatroni ServicePatroniCI
  2. Change Technician - @rhenchen.gitlab
  3. Change Reviewer - @alexander-sosna or @bshah11
  4. Time tracking - 1 hour and 30 minutes
  5. Downtime Component - yes

Detailed steps for the change

Prep Tasks

T minus 2 weeks (2023-05-05 02:00 UTC)

  1. CMOC : Ensure that the maintenance window is scheduled on status.io.
  2. CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
    • Message: In 2 weeks, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
  3. Ask our CSMs in our #customer-success Slack channel about their preferences on how to communicate this change to our main customers:
    1. Ping CSM managers using the @cs-tam-mgrs alias to request that they notify the CSMs for our top SaaS customers.
  4. Share information and a link to the Issue in #whats-happening-at-gitlab slack channel
  5. Create communication issue (@kwanyangu)

T minus 1 weeks (2023-05-12 02:00 UTC)

  1. CMOC : Communicate 1 week to maintenance
    • Message: Next week, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 3 days (2023-05-17 02:00 UTC)

  1. CMOC : Communicate 3 days to maintenance patroni
    • Message: We will be conducting a maintenance database activity this Saturday, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
  2. DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyed into nodes run_list in TF

T minus 1 day (2023-05-19 02:00 UTC)

  1. CMOC : Communicate 1 day to maintenance patroni
    • Message: We will be conducting a maintenance database activity Tomorrow, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 2 hours (2023-05-20 01:00 UTC)

  1. CMOC : Communicate 2 hours to maintenance patroni
    • Message: We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 1 hours (2023-05-20 02:00 UTC)

  1. CMOC : Communicate 1 hour to maintenance patroni
    • Message: We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

Abort Maintenance due to incident recovery (2023-05-20 03:00 UTC)

  1. CMOC : Communicate maintenance aborted due to incident recovery as mentioned at #10694 (comment 1398343384)
    • Message: The maintenance is rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. There is currently a long database migration running to fix an incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/14468) that we can't risk to interrupt. We apologise for any inconvenience.

New communication plan

T minus 3 days (2023-05-24 02:00 UTC)

  1. CMOC : Communicate 3 days to maintenance patroni
    • Message: We will be conducting a maintenance database activity this Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
  2. DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyed into nodes run_list in TF

T minus 1 day (2023-05-26 02:00 UTC)

  1. CMOC : Communicate 1 day to maintenance patroni
    • Message: We will be conducting a maintenance database activity Tomorrow, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 2 hours (2023-05-27 01:00 UTC)

  1. CMOC : Communicate 2 hours to maintenance patroni
    • Message: We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 1 hours (2023-05-27 02:00 UTC)

  1. CMOC : Communicate 1 hour to maintenance patroni
    • Message: We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 15 minutes (2023-05-27 02:45 UTC)

  • Confirm current Primary nodes of both clusters
    ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  • Open Patroni logs in the old and new Primary DBs
    • ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
    • ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
    • ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
    • ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
  • Increase PostgreSQL max_connections (by 1/3 of the current value of 500) =~ 670 in patroni-main

Change Steps - steps to take to execute the change

[2023-05-27 03:00 UTC] Patroni Primary Switchover

Estimated Time to Complete (mins) - 60 minutes

  • Set label changein-progress /label ~change::in-progress

  • Silence alerts: execute /chatops run pager pause in #production

  • CMOC : Communicate the start of the Maintenance patroni

    • Message: The database maintenance is starting now. The database Primary node switchover should happen at any moment in the next hour and during the switchover users might see 50X errors for a very brief span of time. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
  • Restart patroni Leader candidates to apply new settings and make them run in the latest deployed PG minor version (this might cause errors in Rails, but it should be transparent for customers due to retry in healthy replicas)

    • Main

      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-main-pg12-2004 patroni-main-2004-101-db-gprd.c.gitlab-production.internal"
      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    • CI

      ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-ci-pg12-2004 patroni-ci-2004-101-db-gprd.c.gitlab-production.internal"
      ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  • Restart Registry cluster to make it run in the latest deployed PG minor version quick downtime for registry database

    ## Restart Registry cluster
    ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-pg12-patroni-registry "
    
    ## Wait until Registry cluster is running again
    ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  • Check if server_version of running postmasters on the future Primary nodes are now 12.14 or latest;

    ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""
    ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""
    ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""
  • Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start

    ## Disable chef-client (to avoid Chef auto starting pgbouncer)
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\""
    
    ## Shutdown pgbouncer processes
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\""
    
    ## Check if pgbouncer processes were killed and are not running
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
    ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"
  • Wait for patroni-main-2004-101-db-gprd and patroni-ci-2004-101-db-gprd to get back in sync

    ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  • Switchover both CI and Main primary nodes

    • Connect into patroni-ci-2004-101-db-gprd.c.gitlab-production.internal and run:

      knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
      sudo gitlab-patronictl switchover --master patroni-ci-2004-05-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-101-db-gprd.c.gitlab-production.internal
    • Connect into patroni-main-2004-101-db-gprd.c.gitlab-production.internal and run:

      knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
      sudo gitlab-patronictl switchover --master patroni-main-2004-04-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-101-db-gprd.c.gitlab-production.internal
    • Check new Cluster status

      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
      ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    • Check if Replicas are in SYNC with the new Primary (TL should be updated)

    • Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots"

    • Validate master endpoints update in DNS

      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul."
      knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV"
      
      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul."
      knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"
  • Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish

    ## Enable chef-client (to avoid Chef auto starting pgbouncer)
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-enable"
    
    ## Start pgbouncer processes
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer"
    
    ## Check if pgbouncer processes were started and are running
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
    ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"
  • Validate switch of Write/DML operations to new Primary instances

  • Resume alerts: execute /chatops run pager resume in #production

  • CMOC : Communicate the end of the Maintenance

    • Message: The database switchover is now complete, we expect all SQL statements to be routed to the new nodes. The site is back up and we're continuing to verify that all systems are functioning correctly. Thank you for your patience.
  • Mark old Primary N1 nodes to_be_destroyed

    knife node run_list add patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
    knife node run_list add patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
    knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client"
  • Merge MR to Sync to_be_destroyed into nodes run_list - MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816

  • EOC : Confirm with EOC that no more 50x Erros are being logged

  • CMOC : Communicate the end of the Maintenance patroni

    • Click "Finish Maintenance" and send the following:
    • Message: GitLab.com's database layer maintenance is complete now, and we're fully back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
  • Set label changescheduled /label ~change::scheduled

  • Create MR to Destroy nodes patroni-main-2004-04 and patroni-ci-2004-05

  • Schedule node destruction after grace period of 1 week

[2023-06-05 02:00 UTC] Destroy old N1 VMs

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

  • CMOC : Communicate Switchback patroni

    • Message: Due to an issue during the maintenance, we have initiated a rollback of the Primary node hardware upgrade. We will send another update within the next 30 minutes.
  • Remove to_be_destroyed from old Primary N1 nodes run_list

    knife node run_list remove patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
    knife node run_list remove patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
    knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client"
  • Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start

    ## Disable chef-client (to avoid Chef auto starting pgbouncer)
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\""
    
    ## Shutdown pgbouncer processes
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\""
    
    ## Check if pgbouncer processes were killed and are not running
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
    ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"
  • Switchback Primary nodes

    • Connect into patroni-ci-2004-05-db-gprd.c.gitlab-production.internal and run:
    knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
    sudo gitlab-patronictl switchover --master patroni-ci-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-05-db-gprd.c.gitlab-production.internal
    • Connect into patroni-main-2004-04-db-gprd.c.gitlab-production.internal and run:
    knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
    sudo gitlab-patronictl switchover --master patroni-main-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-04-db-gprd.c.gitlab-production.internal
    • Check new Cluster status

      ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
      ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
    • Check if Replicas are in SYNC with the new Primary (TL should be updated)

    • Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots"

    • Validate master endpoints update in DNS

      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul."
      knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV"
      
      ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul."
      knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"
  • Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish

    ## Enable chef-client (to avoid Chef auto starting pgbouncer)
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-enable"
    
    ## Start pgbouncer processes
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer"
    
    ## Check if pgbouncer processes were started and are running
    knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
    ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"
  • Resume alerts: execute /chatops run pager resume in #production

  • EOC : Confirm with EOC that no more 50x Erros are being logged

  • CMOC : Communicate Switchback Completed patroni

    • Message: GitLab.com's database rollback is complete now, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
  • Revert MR to Sync to_be_destroyed into nodes run_list: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816

  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe - during the CR execution

Key metrics to observe after the CR execution, during grace period

Key resource usage metrics

* Network and Storage I/O performance limits in GPRD are based on SSD (performance) persistent disk of 2.5 TBs and n1-standard-8 VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Rafael Henchen