2022-08-15: GSTG - Migrate TF module for Patroni clusters to 6.0.0

Production Change

Change Summary

For Patroni clusters, we use the generic-stor-with-group module (currently at version 5.4.0), which is a module we maintain ourselves. This module relies on the values of node_count and multizone_node_count to create instances, static IP addresses, and all other resources required for a Patroni cluster. These variables (node_count and multizone_node_count) are referenced by count to create N number of resources (for example: instances). This works fine, but it's limiting in that if you want to remove an instance from the middle of the cluster, you simply can't. To remove an instance, you'd have to reduce the node_count value, which would destroy the last instance, but this may not be the one you're looking to remove.

With the changes in https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/generic-stor-with-group/-/merge_requests/39, we have deprecated node_count and multizone_node_count in favor of defining a nodes map. By using a map, we can remove an instance from the middle, if we wish, and only that instance will be affected. Additionally, we've deprecated the use of a separate per_node_chef_run_list map in favor of defining chef_run_list_extra attribute for whichever instance requires an extra run list item.

This should be a no-op CR as we are simply making state changes by migrating from the old resources to the new ones. To make these state changes, we are using moved blocks created by a script.

Change Details

  1. Services Impacted - ServicePatroni ServicePatroniCI
  2. Change Technician - @gsgl
  3. Change Reviewer - @rhenchen.gitlab
  4. Time tracking - unknown
  5. Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Set label changein-progress /label ~change::in-progress
  • Take a backup of the state - in config-mgmt/environments/gstg, do tf state pull > $(date +%Y%m%d).tfstate
  • In config-mgmt/environments/gstg, ensure your working copy of the repo is up-to-date, then check that there are no pending changes: tf plan
  • Update all the module definitions in config-mgmt/environments/gstg/main.tf that use generic-stor-with-group to point at the relative path for the module (source = "../../../../../../ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/generic-stor-with-group") and comment out version. Also create the nodes map for each module and remove node_count, multizone_node_count and per_node_chef_run_list.
  • Run tf init -upgrade in config-mgmt/environments/gstg
  • Create all the moved_* files for each module using this script
  • Run tf apply to do all the moves in the state file. Ensure there are no changes to be added, destroyed or changed. Should only have moves.
  • Update the previously modified modules to revert the source and version lines, then bump version to 6.0.0
  • Run tf init -upgrade
  • Remove all the moved_* files
  • Run a tf plan and ensure no changes.
  • File an MR with the version bumps to 6.0.0. Ensure Terraform report looks clean. Get it reviewed/merged.
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Push the old state back up: tf state push -force $(date +%Y%m%d).tfstate
  • If the MR that has the version bumps to 6.0.0 has already been merged, revert it.
  • Check that there are no changes - in config-mgmt/environments/gstg, do a tf plan
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

  • Metric: Metric Name
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Describe Changes

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Gonzalo Servat