Expand and use Cloud NAT in gprd
Production Change - Criticality 2 C2
| Change Objective | Describe the objective of the change |
|---|---|
| Change Type | Modification of Cloud NAT, then removal of most instance public IPs. |
| Services Impacted | All |
| Change Team Members | @craigf, @hphilipps, and perhaps some kind volunteers in a convenient timezone (to update) |
| Change Severity | C2 |
| Buddy check | A colleague will review the change (@hphilipps) |
| Tested in staging | The change was tested on staging environment. See https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1036 followed by https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1035. |
| Schedule of the change | 2019-09-30 00:45 UTC |
| Duration of the change | 1 hour |
Steps
Part 1: Expand Cloud NAT (COMPLETE)
Part 1 has been executed. Move onto part 2.
A Cloud NAT instance already exists in gprd, serving GKE. It only NATs traffic from the GKE subnet. Cloud NAT instances can serve whole regions within a network, but 2 NAT instances cannot coexist if there is any overlap in service. Therefore we must expand the coverage of the existing Cloud NAT.
These steps cutover between NAT IPs gracefully and this part should not cause downtime to any service. Therefore it should be safe to execute this part at any time of day.
- Before this change, only GKE nodes are using the current Cloud NAT. Run a test to ensure their public internet connectivity is maintained:
- Get set up with GKE in gprd if you haven't already: https://gitlab.com/gitlab-com/runbooks/blob/master/howto/k8s-gitlab-operations.md#console-server-setup-for-the-oncall
kubectl run yourname-test -it --image=alpine --restart=Never --rm -- sh -c 'apk add --no-cache curl && while true; do curl https://api.ipify.org; echo; sleep 1; done'- Keep an eye on this NAT IP while performing the next steps.
-
cd <gitlab-com-infrastructure>/environments/gprd && git fetch && git checkout craigf/cloud-nat-gprd && tf init. This branch is MRed in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1045. - Create the only non-conflicting resources:
tf apply -target module.nat.google_compute_address.nat_ips - Outside of terraform, edit the Cloud NAT resource: add the new IPs, and save. Put the old IP into draining mode, and save. Remove the old IP, and save again. Using draining mode and waiting a few seconds should be enough to mitigate the error burst that could occur when removing a NAT IP, as inbound packets fail to route back.
- Check on the NAT IP check pod that we've been running in step 1. The IP should have changed, and NAT should still be working.
- Check that the registry application is making outbound requests to GCS successfully (the registry is running in GKE now): Navigate to a container registry page. Example: https://gitlab.com/gitlab-com/gl-infra/ci-images/container_registry.
- Edit the terraform state to bring the NAT and router resources under management of the new module instance:
tf state rm google_compute_router.nat-routertf state rm google_compute_router_nat.gke-nattf import module.nat.google_compute_router.router gitlab-gketf import module.nat.google_compute_router_nat.nat gitlab-gke/gitlab-gke
-
tf plan. It should want to modify the NAT's source ranges to cover the whole region, and to destroy the now-unused static IP that was previously declared. It should be safe to apply this plan. No "forces replacement" diff should be shown for the NAT or the router. - Keep an eye on the NAT IP check pod that we've been running in step 1. If all is still good you can interrupt it with ^C.
- Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1045.
Rollback:
git checkout mastertf state rm module.nat.google_compute_router.routertf state rm module.nat.google_compute_router_nat.nattf import google_compute_router.nat-router gitlab-gketf import google_compute_router_nat.gke-nat gitlab-gke/gitlab-gketf apply -target google_compute_address.gke-cloud-nat-ip- Manually edit the Cloud NAT, adding the "new" IP (from the restored terraform declaration in the last step) and save. Put the soon-to-be-deleted module IPs from the cloud NAT module into draining mode and save.
- The NAT IP check pod should change apparent IP.
-
tf plan. It should want to de-scope the Cloud NATs source ranges to just the GKE subnet, and remove the module IPs that are in drain. It should be safe to apply this plan. - The NAT IP check pod should keep working, showing the same IP.
Part 2: Remove instance public IPs
Cloud NAT automatically NATs traffic from instances inside its covered subnets that do not have public IPs, and cannot be made to NAT traffic from instances that do have public IPs. We thereofore must remove public IPs from gprd instances.
Unfortunately, this is not a graceful operation. In-flight outbound requests can fail as inbound response packets fail to route back after the IP cutover. gstg testing in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1046 revealed that for some traffic (e.g. HTTP) we must expect to wait for the application-level client timeout, which can be 10s of seconds. Actual outbound traffic outage is ~ a few seconds.
For this reason we want to perform part 2 during our least busy times.
-
Prove the Cloud NAT provisioned in part 1 is working. Edit the console VM, removing its public IP. You should still be able to access the public internet from that box. -
cd <gitlab-com-infrastructure>/environments/gprd && git fetch && git checkout craigf/remove-public-ips-gprd && tf init. This branch is MRed in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1046. -
Inform the SRE on-call about what you're doing, as the apparent availability loss during draining might set off an alert. -
While running this, keep an eye on the dashboard. There is no alerting here as occasional appearances of low error rates appear to be a feature of CI that doesn't cause harm. Even so, keep an eye on it, and if the error rate climbs as you remove public IPs (or afterwards), follow the following steps:
- Increase nat ports per VM. The default of 64 is currently used in gprd and gstg. Successively double it. In CI, in which machines make many concurrent outbound connections, we've settled on 256.
- Wait a bit and see the error rate decrease.
- Don't be alarmed at every individual error: it represents a dropped packet, and higher-level protocols such as TCP should retry.
- In the highly unlikely event you end up dialling up nat ports per VM past 256, be aware of the formula for IP count. 5 IPs @ 256 ports should support 1250 VMs, far more than we have in gprd today, so it shouldn't be necessary to increase the number of IPs, but you can if you need to. A rollback might be preferable to this.
-
Drain, then remove public IPs from, canaries:
-
/chatops run canary --drain --production -
Wait for canary traffic to drop. -
gcloud --project=gitlab-production compute instances delete-access-config --access-config-name "external-nat" $instance-nameFor each canary instance. This will remove the public IPs. -
/chatops run canary --ready --production
-
-
Drain, then remove public IPs from, risky sections of the sv tier in a rolling fashion:
-
For each of web and git sections of the fleet: - Set 5 VMs at a time (this will run in a quiet period) to drain: e.g.
./bin/set-server-state gprd drain "web-0[1-5]" -
gcloud --project=gitlab-production compute instances delete-access-config --access-config-name "external-nat" $instance-nameFor each drained instance. This will remove the public IPs. - Return the 5 nodes to service e.g.
./bin/set-server-state gprd ready "web-0[1-5]"
- Set 5 VMs at a time (this will run in a quiet period) to drain: e.g.
-
Sidekiq: - For each sidekiq node (you can probably get away with 2 at a time during the quiet period):
knife ssh 'TARGETNODE' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'(lifted from #997 (closed)). This tells sidekiq to finish all current work and stop accepting new jobs (https://github.com/mperham/sidekiq/wiki/Deployment). - Wait for jobs to finish
1.
gcloud --project=gitlab-production compute instances delete-access-config --access-config-name "external-nat" $instance-nameFor each drained instance. This will remove the public IPs. - Restart sidekiq on the drained nodes.
- For each sidekiq node (you can probably get away with 2 at a time during the quiet period):
-
-
Remove public IPs from the rest of the fleet
-
tf plan -parallelism 100 -out plan -
If the plan only shows relevant changes: tf apply ./plan. See below for what might constitute relevant changes. - If the plan doesn't show only relevant changes:
- Get a list of all modules:
tf state list | grep '^module' | sed -E 's/^(module\.[^\.]+).*$/\1/g' | sort -u - For each module that looks like it might contain instances,
terraform apply -target module.module_name. Review the plan, and agree if happy. - A few iterations of the above steps should make the total plan more manageable.
- Get a list of all modules:
-
-
Validate that no gprd machines have public IPs by running gcloud --project=gitlab-production compute instances list --format 'table(NAME, EXTERNAL_IP)' | grep -P '\d+\.\d+\.\d+\.\d+$'(GNU grep). -
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/1046.
Example of a "good" instance plan diff:
# module.redis-sidekiq.google_compute_instance.instance_with_attached_disk[2] will be updated in-place
~ resource "google_compute_instance" "instance_with_attached_disk" {
allow_stopping_for_update = true
can_ip_forward = false
cpu_platform = "Intel Haswell"
deletion_protection = false
guest_accelerator = []
id = "redis-sidekiq-03-db-gstg"
instance_id = "4732032743729991807"
label_fingerprint = "VDcZzyRQv-c="
labels = {
"environment" = "gstg"
"pet_name" = "redis-sidekiq"
}
machine_type = "n1-standard-2"
metadata = {
"CHEF_BOOTSTRAP_BUCKET" = "gitlab-gstg-chef-bootstrap"
"CHEF_BOOTSTRAP_KEY" = "gitlab-gstg-bootstrap-validation"
"CHEF_BOOTSTRAP_KEYRING" = "gitlab-gstg-bootstrap"
"CHEF_DNS_ZONE_NAME" = "gitlab.com"
"CHEF_ENVIRONMENT" = "gstg"
"CHEF_INIT_RUN_LIST" = ""
"CHEF_NODE_NAME" = "redis-sidekiq-03-db-gstg.c.gitlab-staging-1.internal"
"CHEF_PROJECT" = "gitlab-staging-1"
"CHEF_RUN_LIST" = "\"role[gstg-base-db-redis-server-sidekiq]\""
"CHEF_URL" = "https://chef.gitlab.com/organizations/gitlab/"
"CHEF_VERSION" = "14.13.11"
"GL_BOOTSTRAP_DATA_DISK" = "true"
"GL_FORMAT_DATA_DISK" = "false"
"GL_KERNEL_VERSION" = ""
"GL_PERSISTENT_DISK_PATH" = "/var/opt/gitlab"
"block-project-ssh-keys" = "TRUE"
"enable-oslogin" = "FALSE"
"shutdown-script" = "#!/bin/bash\nCHEF_NODE_NAME=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/CHEF_NODE_NAME\" -H \"Metadata-Flavor: Google\")\"\nif [[ ! -f /etc/chef/client.pem ]]; then\n # No client.pem, nothing to do\n exit 0\nfi\n\nif type -P knife >/dev/null; then\n knife node delete \"$CHEF_NODE_NAME\" -c /etc/chef/client.rb -y\n knife client delete \"$CHEF_NODE_NAME\" -c /etc/chef/client.rb -y\nfi\n\nrm -f /etc/chef/client.pem\n"
}
metadata_fingerprint = "avEikb_DNS4="
metadata_startup_script = "#!/bin/bash\n# vim: ai:ts=8:sw=8:noet\n# This script is passed as a startup-script to GCP instances\n###################################################\n### NOTE: It is being run on _every_ boot ###\n### It MUST be non destructive and itempotent ###\n###################################################\n\nexec &> >(tee -a \"/var/tmp/bootstrap-$(date +%Y%m%d-%H%M%S).log\")\nset -x\n\nSECONDS=0\necho \"$(date -u): Bootstrap start\"\n\nenv\n\n\n# Pass env variables\nfor i in $(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/\" -H \"Metadata-Flavor: Google\"); do\n if [[ $i == CHEF* ]]; then\n export \"$i\"=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/$i\" -H \"Metadata-Flavor: Google\")\"\n fi\n if [[ $i == GL* ]]; then\n export \"$i\"=\"$(curl -s \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/$i\" -H \"Metadata-Flavor: Google\")\"\n fi\ndone\n\n# Lookup consul's service endpoint\napt-get install jq -y -q\n\nformat_ext4() {\n mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard $1\n}\n\nmount_device() {\n local device_path=$1\n local mount_path=$2\n\n mkdir -p \"$mount_path\"\n if ! grep -qs \"$mount_path\" /proc/mounts; then\n mount -o discard,defaults $device_path \"$mount_path\"\n fi\n local UUID=\"$(sudo blkid -s UUID -o value $device_path)\"\n if ! grep -qs \"$UUID\" /etc/fstab; then\n echo UUID=\"$UUID\" \"$mount_path\" ext4 discard,defaults 0 2 | tee -a /etc/fstab\n fi\n}\n\nif [[ -L /dev/disk/by-id/google-log ]]; then\n if [[ $(file -sL /dev/disk/by-id/google-log) != *Linux* ]]; then\n format_ext4 /dev/disk/by-id/google-log\n fi\n\n # In case we resized the underlying GCP disk\n resize2fs /dev/disk/by-id/google-log\n\n mount_device /dev/disk/by-id/google-log /var/log\nfi\n\n# default to false, force a reformat even if there is an existing\n# Linux filesystem\nGL_FORMAT_DATA_DISK=${GL_FORMAT_DATA_DISK:-false}\n\nif [[ -b /dev/sdb && (\"true\" == \"${GL_FORMAT_DATA_DISK}\" || $(file -sL /dev/sdb) != *Linux*) ]]; then\n format_ext4 /dev/sdb\nfi\n\n# Proceed with mounting\nif [[ -L /dev/disk/by-id/google-persistent-disk-1 ]]; then\n mount_device /dev/sdb \"${GL_PERSISTENT_DISK_PATH:-/var/opt/gitlab}\"\nfi\n\n# Install chef\n\ncurl -L https://omnitruck.chef.io/install.sh | sudo bash -s -- -v \"${CHEF_VERSION}\"\n\nmkdir -p /etc/chef\n\nif [[ ! -e /etc/chef/client.rb ]]; then\n # create client.rb\n cat > /etc/chef/client.rb <<-EOF\nchef_server_url \"$CHEF_URL\"\nvalidation_client_name \"gitlab-validator\"\nlog_location STDOUT\nnode_name \"$CHEF_NODE_NAME\"\nenvironment \"$CHEF_ENVIRONMENT\"\nEOF\nfi\n\nif [[ ! -e /etc/chef/client.pem ]]; then\n # Get validation.pem from gkms and register node\n gsutil cp gs://$CHEF_BOOTSTRAP_BUCKET/validation.enc /tmp/validation.enc\n\n gcloud kms decrypt --keyring=$CHEF_BOOTSTRAP_KEYRING --location=global --key=$CHEF_BOOTSTRAP_KEY --plaintext-file=/etc/chef/validation.pem --ciphertext-file=/tmp/validation.enc\n\n # register client\n chef-client\n rm -f /tmp/validation.enc /etc/chef/validation.pem\nfi\n\n# persist the run list\nknife node -c /etc/chef/client.rb run_list set $CHEF_NODE_NAME $(echo $CHEF_RUN_LIST | tr -d '\"')\n\n# run chef using the new or modified runlist\nchef-client\n\n# On first boot run the additional runlist if it is defined.\nif [[ ! -e /var/tmp/inital-boot-run.lock && -n $CHEF_INIT_RUN_LIST ]]; then\n CHEF_CLIENT_ARGS=\"-o $(echo \"$CHEF_INIT_RUN_LIST\" | sed 's/\"\\|,$//g')\"\n chef-client $CHEF_CLIENT_ARGS\nfi\n\n# Upgrade the kernel, but only if we can find a package with the specified version\n# in the updated Package list\napt-get update\nif [ `apt-cache search linux-image-${GL_KERNEL_VERSION}-gcp |wc -l` == 1 ]; then\n if [[ -n $GL_KERNEL_VERSION && $(uname -r) != *${GL_KERNEL_VERSION}* ]]; then\n apt-get install -y linux-modules-${GL_KERNEL_VERSION}-gcp linux-modules-extra-${GL_KERNEL_VERSION}-gcp linux-image-${GL_KERNEL_VERSION}-gcp linux-gcp-headers-$GL_KERNEL_VERSION\n apt-get purge -y $(dpkg-query -W -f='${binary:Package}\\n' 'linux-image*' 'linux-headers*' | grep -v $GL_KERNEL_VERSION)\n update-grub\n touch /tmp/bootstrap-reboot\n fi\nfi\n\nduration=$SECONDS\necho \"$(date -u): Bootstrap finished in $(($duration / 60)) minutes and $(($duration % 60)) seconds\"\n\ntouch /var/tmp/inital-boot-run.lock\n\nif [[ -f /tmp/bootstrap-reboot ]]; then\n rm -f /tmp/bootstrap-reboot\n reboot\nfi\n"
name = "redis-sidekiq-03-db-gstg"
project = "gitlab-staging-1"
self_link = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/instances/redis-sidekiq-03-db-gstg"
tags = [
"gstg",
"redis-sidekiq",
]
tags_fingerprint = "Ck84bZ43JEM="
zone = "us-east1-b"
attached_disk {
device_name = "persistent-disk-1"
mode = "READ_WRITE"
source = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg-data"
}
attached_disk {
device_name = "log"
mode = "READ_WRITE"
source = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg-log"
}
boot_disk {
auto_delete = true
device_name = "persistent-disk-0"
source = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/zones/us-east1-b/disks/redis-sidekiq-03-db-gstg"
initialize_params {
image = "https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20180122"
labels = {}
size = 20
type = "pd-standard"
}
}
~ network_interface {
name = "nic0"
network = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/global/networks/gstg"
network_ip = "10.224.22.103"
subnetwork = "https://www.googleapis.com/compute/v1/projects/gitlab-staging-1/regions/us-east1/subnetworks/redis-sidekiq-gstg"
subnetwork_project = "gitlab-staging-1"
- access_config {
- nat_ip = "35.196.231.142" -> null
- network_tier = "PREMIUM" -> null
}
}
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
preemptible = false
}
service_account {
email = "terraform@gitlab-staging-1.iam.gserviceaccount.com"
scopes = [
"https://www.googleapis.com/auth/cloud.useraccounts.readonly",
"https://www.googleapis.com/auth/cloudkms",
"https://www.googleapis.com/auth/compute.readonly",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append",
]
}
timeouts {}
}
This will look better in color in a terminal. Note that the instance doesn't have to be destroyed (nothing "forces replacement") and the only diff is the removal of the access config on the network interface.
Apart from those instance diffs, static public IPs will also be deleted, and an unused DNS record for pages (a relic of the Azure migration apparently) will also be removed.
Rollback:
-
git checkout master,terraform plan -out plan. If the plan is clean, apply it. Otherwise, go module-by-module in a similar way to above.