Expand and use Cloud NAT in gprd

Production Change - Criticality 2 C2

Change Objective	Describe the objective of the change
Change Type	Modification of Cloud NAT, then removal of most instance public IPs.
Services Impacted	All
Change Team Members	@craigf, @hphilipps, and perhaps some kind volunteers in a convenient timezone (to update)
Change Severity	C2
Buddy check	A colleague will review the change (@hphilipps)
Tested in staging	The change was tested on staging environment. See https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1036 followed by https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1035.
Schedule of the change	2019-09-30 00:45 UTC
Duration of the change	1 hour

Steps

Part 1: Expand Cloud NAT (COMPLETE)

Part 1 has been executed. Move onto part 2.

A Cloud NAT instance already exists in gprd, serving GKE. It only NATs traffic from the GKE subnet. Cloud NAT instances can serve whole regions within a network, but 2 NAT instances cannot coexist if there is any overlap in service. Therefore we must expand the coverage of the existing Cloud NAT.

These steps cutover between NAT IPs gracefully and this part should not cause downtime to any service. Therefore it should be safe to execute this part at any time of day.

Before this change, only GKE nodes are using the current Cloud NAT. Run a test to ensure their public internet connectivity is maintained:
1. Get set up with GKE in gprd if you haven't already: https://gitlab.com/gitlab-com/runbooks/blob/master/howto/k8s-gitlab-operations.md#console-server-setup-for-the-oncall
2. kubectl run yourname-test -it --image=alpine --restart=Never --rm -- sh -c 'apk add --no-cache curl && while true; do curl https://api.ipify.org; echo; sleep 1; done'
3. Keep an eye on this NAT IP while performing the next steps.
cd <gitlab-com-infrastructure>/environments/gprd && git fetch && git checkout craigf/cloud-nat-gprd && tf init. This branch is MRed in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1045.
Create the only non-conflicting resources: tf apply -target module.nat.google_compute_address.nat_ips
Outside of terraform, edit the Cloud NAT resource: add the new IPs, and save. Put the old IP into draining mode, and save. Remove the old IP, and save again. Using draining mode and waiting a few seconds should be enough to mitigate the error burst that could occur when removing a NAT IP, as inbound packets fail to route back.
Check on the NAT IP check pod that we've been running in step 1. The IP should have changed, and NAT should still be working.
Check that the registry application is making outbound requests to GCS successfully (the registry is running in GKE now): Navigate to a container registry page. Example: https://gitlab.com/gitlab-com/gl-infra/ci-images/container_registry.
Edit the terraform state to bring the NAT and router resources under management of the new module instance:
1. tf state rm google_compute_router.nat-router
2. tf state rm google_compute_router_nat.gke-nat
3. tf import module.nat.google_compute_router.router gitlab-gke
4. tf import module.nat.google_compute_router_nat.nat gitlab-gke/gitlab-gke
tf plan. It should want to modify the NAT's source ranges to cover the whole region, and to destroy the now-unused static IP that was previously declared. It should be safe to apply this plan. No "forces replacement" diff should be shown for the NAT or the router.
Keep an eye on the NAT IP check pod that we've been running in step 1. If all is still good you can interrupt it with ^C.
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1045.

Rollback:

git checkout master
tf state rm module.nat.google_compute_router.router
tf state rm module.nat.google_compute_router_nat.nat
tf import google_compute_router.nat-router gitlab-gke
tf import google_compute_router_nat.gke-nat gitlab-gke/gitlab-gke
tf apply -target google_compute_address.gke-cloud-nat-ip
Manually edit the Cloud NAT, adding the "new" IP (from the restored terraform declaration in the last step) and save. Put the soon-to-be-deleted module IPs from the cloud NAT module into draining mode and save.
The NAT IP check pod should change apparent IP.
tf plan. It should want to de-scope the Cloud NATs source ranges to just the GKE subnet, and remove the module IPs that are in drain. It should be safe to apply this plan.
The NAT IP check pod should keep working, showing the same IP.

Part 2: Remove instance public IPs

Cloud NAT automatically NATs traffic from instances inside its covered subnets that do not have public IPs, and cannot be made to NAT traffic from instances that do have public IPs. We thereofore must remove public IPs from gprd instances.

Unfortunately, this is not a graceful operation. In-flight outbound requests can fail as inbound response packets fail to route back after the IP cutover. gstg testing in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1046 revealed that for some traffic (e.g. HTTP) we must expect to wait for the application-level client timeout, which can be 10s of seconds. Actual outbound traffic outage is ~ a few seconds.

For this reason we want to perform part 2 during our least busy times.

Example of a "good" instance plan diff:

  # module.redis-sidekiq.google_compute_instance.instance_with_attached_disk[2] will be updated in-place
  ~ resource "google_compute_instance" "instance_with_attached_disk" {
        allow_stopping_for_update = true
        can_ip_forward            = false
        cpu_platform              = "Intel Haswell"
        deletion_protection       = false
        guest_accelerator         = []
        id                        = "redis-sidekiq-03-db-gstg"
        instance_id               = "4732032743729991807"
        label_fingerprint         = "VDcZzyRQv-c="
        labels                    = {
            "environment" = "gstg"
            "pet_name"    = "redis-sidekiq"
        }
        machine_type              = "n1-standard-2"
        metadata                  = {
            "CHEF_BOOTSTRAP_BUCKET"   = "gitlab-gstg-chef-bootstrap"
            "CHEF_BOOTSTRAP_KEY"      = "gitlab-gstg-bootstrap-validation"
            "CHEF_BOOTSTRAP_KEYRING"  = "gitlab-gstg-bootstrap"
            "CHEF_DNS_ZONE_NAME"      = "gitlab.com"
            "CHEF_ENVIRONMENT"        = "gstg"
            "CHEF_INIT_RUN_LIST"      = ""
            "CHEF_NODE_NAME"          = "redis-sidekiq-03-db-gstg.c.gitlab-staging-1.internal"
            "CHEF_PROJECT"            = "gitlab-staging-1"
            "CHEF_RUN_LIST"           = "\"role[gstg-base-db-redis-server-sidekiq]\""
            "CHEF_URL"                = "https://chef.gitlab.com/organizations/gitlab/"
            "CHEF_VERSION"            = "14.13.11"
            "GL_BOOTSTRAP_DATA_DISK"  = "true"
            "GL_FORMAT_DATA_DISK"     = "false"
            "GL_KERNEL_VERSION"       = ""
            "GL_PERSISTENT_DISK_PATH" = "/var/opt/gitlab"
            "block-project-ssh-keys"  = "TRUE"
            "enable-oslogin"          = "FALSE"
            "shutdown-script"         = "#!/bin/bash\nCHEF_NODE_NAME=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/CHEF_NODE_NAME\" -H \"Metadata-Flavor: Google\")\"\nif [[ ! -f /etc/chef/client.pem ]]; then\n    # No client.pem, nothing to do\n    exit 0\nfi\n\nif type -P knife >/dev/null; then\n    knife node delete \"$CHEF_NODE_NAME\" -c /etc/chef/client.rb -y\n    knife client delete \"$CHEF_NODE_NAME\" -c /etc/chef/client.rb -y\nfi\n\nrm -f /etc/chef/client.pem\n"
        }
        metadata_fingerprint      = "avEikb_DNS4="
        metadata_startup_script   = "#!/bin/bash\n# vim: ai:ts=8:sw=8:noet\n# This script is passed as a startup-script to GCP instances\n###################################################\n###    NOTE: It is being run on _every_ boot    ###\n###  It MUST be non destructive and itempotent  ###\n###################################################\n\nexec &> >(tee -a \"/var/tmp/bootstrap-$(date +%Y%m%d-%H%M%S).log\")\nset -x\n\nSECONDS=0\necho \"$(date -u): Bootstrap start\"\n\nenv\n\n\n# Pass env variables\nfor i in $(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/\" -H \"Metadata-Flavor: Google\"); do\n    if [[ $i == CHEF* ]]; then\n        export \"$i\"=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/$i\" -H \"Metadata-Flavor: Google\")\"\n    fi\n    if [[ $i == GL* ]]; then\n        export \"$i\"=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/$i\" -H \"Metadata-Flavor: Google\")\"\n    fi\ndone\n\n# Lookup consul's service endpoint\napt-get install jq -y -q\n\nformat_ext4() {\n    mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard $1\n}\n\nmount_device() {\n    local device_path=$1\n    local mount_path=$2\n\n    mkdir -p \"$mount_path\"\n    if ! grep -qs \"$mount_path\" /proc/mounts; then\n        mount -o discard,defaults $device_path \"$mount_path\"\n    fi\n    local UUID=\"$(sudo blkid -s UUID -o value $device_path)\"\n    if ! grep -qs \"$UUID\" /etc/fstab; then\n        echo UUID=\"$UUID\" \"$mount_path\" ext4 discard,defaults 0 2 | tee -a /etc/fstab\n    fi\n}\n\nif [[ -L /dev/disk/by-id/google-log ]]; then\n    if [[ $(file -sL /dev/disk/by-id/google-log) != *Linux* ]]; then\n      format_ext4 /dev/disk/by-id/google-log\n    fi\n\n    # In case we resized the underlying GCP disk\n    resize2fs /dev/disk/by-id/google-log\n\n    mount_device /dev/disk/by-id/google-log /var/log\nfi\n\n# default to false, force a reformat even if there is an existing\n# Linux filesystem\nGL_FORMAT_DATA_DISK=${GL_FORMAT_DATA_DISK:-false}\n\nif [[ -b /dev/sdb && (\"true\" ==  \"${GL_FORMAT_DATA_DISK}\" || $(file -sL /dev/sdb) != *Linux*) ]]; then\n    format_ext4 /dev/sdb\nfi\n\n# Proceed with mounting\nif [[ -L /dev/disk/by-id/google-persistent-disk-1 ]]; then\n    mount_device /dev/sdb \"${GL_PERSISTENT_DISK_PATH:-/var/opt/gitlab}\"\nfi\n\n# Install chef\n\ncurl -L https://omnitruck.chef.io/install.sh | sudo bash -s -- -v \"${CHEF_VERSION}\"\n\nmkdir -p /etc/chef\n\nif [[ ! -e /etc/chef/client.rb ]]; then\n  # create client.rb\n  cat > /etc/chef/client.rb <<-EOF\nchef_server_url  \"$CHEF_URL\"\nvalidation_client_name \"gitlab-validator\"\nlog_location   STDOUT\nnode_name \"$CHEF_NODE_NAME\"\nenvironment \"$CHEF_ENVIRONMENT\"\nEOF\nfi\n\nif [[ ! -e /etc/chef/client.pem ]]; then\n  # Get validation.pem from gkms and register node\n  gsutil cp gs://$CHEF_BOOTSTRAP_BUCKET/validation.enc /tmp/validation.enc\n\n  gcloud kms decrypt --keyring=$CHEF_BOOTSTRAP_KEYRING --location=global --key=$CHEF_BOOTSTRAP_KEY --plaintext-file=/etc/chef/validation.pem  --ciphertext-file=/tmp/validation.enc\n\n  # register client\n  chef-client\n  rm -f /tmp/validation.enc /etc/chef/validation.pem\nfi\n\n# persist the run list\nknife node -c /etc/chef/client.rb run_list set $CHEF_NODE_NAME $(echo $CHEF_RUN_LIST | tr -d '\"')\n\n# run chef using the new or modified runlist\nchef-client\n\n# On first boot run the additional runlist if it is defined.\nif [[ ! -e /var/tmp/inital-boot-run.lock && -n $CHEF_INIT_RUN_LIST ]]; then\n  CHEF_CLIENT_ARGS=\"-o $(echo \"$CHEF_INIT_RUN_LIST\" | sed 's/\"\\|,$//g')\"\n  chef-client $CHEF_CLIENT_ARGS\nfi\n\n# Upgrade the kernel, but only if we can find a package with the specified version\n# in the updated Package list\napt-get update\nif [ `apt-cache search linux-image-${GL_KERNEL_VERSION}-gcp |wc -l` == 1 ]; then\n  if [[ -n $GL_KERNEL_VERSION && $(uname -r) != *${GL_KERNEL_VERSION}* ]]; then\n    apt-get install -y linux-modules-${GL_KERNEL_VERSION}-gcp linux-modules-extra-${GL_KERNEL_VERSION}-gcp linux-image-${GL_KERNEL_VERSION}-gcp linux-gcp-headers-$GL_KERNEL_VERSION\n    apt-get purge -y $(dpkg-query -W -f='${binary:Package}\\n' 'linux-image*' 'linux-headers*' | grep -v $GL_KERNEL_VERSION)\n    update-grub\n    touch /tmp/bootstrap-reboot\n  fi\nfi\n\nduration=$SECONDS\necho \"$(date -u): Bootstrap finished in $(($duration / 60)) minutes and $(($duration % 60)) seconds\"\n\ntouch /var/tmp/inital-boot-run.lock\n\nif [[ -f /tmp/bootstrap-reboot ]]; then\n  rm -f /tmp/bootstrap-reboot\n  reboot\nfi\n"
        name                      = "redis-sidekiq-03-db-gstg"
        project                   = "gitlab-staging-1"
        self_link                 = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/instances/redis-sidekiq-03-db-gstg"
        tags                      = [
            "gstg",
            "redis-sidekiq",
        ]
        tags_fingerprint          = "Ck84bZ43JEM="
        zone                      = "us-east1-b"

        attached_disk {
            device_name = "persistent-disk-1"
            mode        = "READ_WRITE"
            source      = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg-data"
        }
        attached_disk {
            device_name = "log"
            mode        = "READ_WRITE"
            source      = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg-log"
        }

        boot_disk {
            auto_delete = true
            device_name = "persistent-disk-0"
            source      = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg"

            initialize_params {
                image  = "https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20180122"
                labels = {}
                size   = 20
                type   = "pd-standard"
            }
        }

      ~ network_interface {
            name               = "nic0"
            network            = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/global/networks/gstg"
            network_ip         = "10.224.22.103"
            subnetwork         = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/regions/us-east1/subnetworks/redis-sidekiq-gstg"
            subnetwork_project = "gitlab-staging-1"

          - access_config {
              - nat_ip       = "35.196.231.142" -> null
              - network_tier = "PREMIUM" -> null
            }
        }

        scheduling {
            automatic_restart   = true
            on_host_maintenance = "MIGRATE"
            preemptible         = false
        }

        service_account {
            email  = "terraform@gitlab-staging-1.iam.gserviceaccount.com"
            scopes = [
                "https://www.googleapis.com/auth/cloud.useraccounts.readonly",
                "https://www.googleapis.com/auth/cloudkms",
                "https://www.googleapis.com/auth/compute.readonly",
                "https://www.googleapis.com/auth/devstorage.read_only",
                "https://www.googleapis.com/auth/logging.write",
                "https://www.googleapis.com/auth/monitoring.write",
                "https://www.googleapis.com/auth/pubsub",
                "https://www.googleapis.com/auth/service.management.readonly",
                "https://www.googleapis.com/auth/servicecontrol",
                "https://www.googleapis.com/auth/trace.append",
            ]
        }

        timeouts {}
    }

This will look better in color in a terminal. Note that the instance doesn't have to be destroyed (nothing "forces replacement") and the only diff is the removal of the access config on the network interface.

Apart from those instance diffs, static public IPs will also be deleted, and an unused DNS record for pages (a relic of the Azure migration apparently) will also be removed.

Rollback:

git checkout master, terraform plan -out plan. If the plan is clean, apply it. Otherwise, go module-by-module in a similar way to above.

Edited Sep 30, 2019 by Craig Miskell