Migrate Chef from DO to GCP
Production Change
Change Summary
The goal of this change is to migrate our chef server from DO to GCP.
Related Issue: &138 (closed)
Note: For this change, we will be referring to two different nodes:
-
chef.gitlab.com
orold chef server
-
chef-01-inf-ops.c.gitlab-ops.internal
ornew chef server
Change Details
- Services Impacted - chef.gitlab.com
- Change Technician - @cmcfarland
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @T4cC0re
- Due Date - 2020-09-19 14:00 UTC
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - If there is a need for downtime, include downtime estimate here
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Take a backup (or screenshot) of the DO network rules for chef in case we need to re-create them. Attach the screenshot to this issue. -
Get MR for DNS change approved: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2068 -
On chef-01-inf-ops.c.gitlab-ops.internal
runsudo -i mkdir -p /usr/lib/systemd/system
-
Make sure the chef-server, opscode-analytics, and opscode-push-jobs deb packages are installed on the new chef server. cd /tmp wget -O https://packages.chef.io/files/stable/chef-server/12.6.0/ubuntu/14.04/chef-server-core_12.6.0-1_amd64.deb wget -O https://packages.chef.io/files/stable/opscode-analytics/1.6.6/ubuntu/16.04/opscode-analytics_1.6.6-1_amd64.deb wget -O https://packages.chef.io/files/stable/opscode-push-jobs-server/2.1.1/ubuntu/16.04/opscode-push-jobs-server_2.1.1-1_amd64.deb sudo dpkg -i /tmp/chef-server-core_12.6.0-1_amd64.deb sudo dpkg -i /tmp/opscode-analytics_1.6.6-1_amd64.deb sudo dpkg -i /tmp/opscode-push-jobs-server_2.1.1-1_amd64.deb
-
Install the certificates for nginx onto the new chef server chef-01-inf-ops.c.gitlab-ops.internal
/etc/ssl/chef.gitlab.com.crt
/etc/ssl/chef.gitlab.com.key
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Merge: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2068 (BUT DO NOT APPLY) -
Actions on
old chef
node:-
Delete inbound access (Except SSH) from the DO chef server network rules (console) - At this point, we expect to see alerts showing that Chef clients are unable to converge. This is intended.
-
Verify that chef.gitlab.com
is unreachable except by SSH.curl chef.gitlab.com
-
SSH into chef.gitlab.com
-
Remove the online backup config from the /etc/opscode/chef-server.rb
server config. -
Run sudo chef-server-ctl reconfigure
to apply the change. -
Create an offline backup of the chef config with sudo chef-server-ctl backup
. This step will stop the Chef components that need to be and leave up the ones required to perform the backup. After it is done, it will leave the Chef server up. -
Stop chef server on chef.gitlab.com
withsudo chef-server-ctl stop
. -
Note the location and name of the backup in '/var/opt/chef-backup'.
-
-
Log out of chef.gitlab.com
. This is to avoid confusion of which server is being worked on.
-
-
Actions on
new chef
node:-
SSH into chef-01-inf-ops.c.gitlab-ops.internal
such that agent forwarding is used:ssh -A chef-01-inf-ops.c.gitlab-ops.internal
-
scp the backup from chef.gitlab.com
tochef-01-inf-ops.c.gitlab-ops.internal
scp chef.gitlab.com:/var/opt/chef-backup/chef-backup-2020-XX-XX-XX-XX-XX.tgz /tmp
-
Make sure the chef server is not running on chef-01-inf-ops.c.gitlab-ops.internal
:sudo chef-server-ctl stop
-
Restore the backup with a cleanse: sudo chef-server-ctl restore -c /tmp/chef-backup-2020-XX-XX-XX-XX-XX.tgz
-
-
Verify the new chef server is working properly
-
Run sudo chef-server-ctl test
onchef-01-inf-ops.c.gitlab-ops.internal
to test the server. There should be only two errors that are expected and known. -
On chef-01-inf-ops.c.gitlab-ops.internal
, edit the/etc/chef/client.rb
file so that the client will point to itself (chef-01-inf-ops.c.gitlab-ops.internal
and make sure thessl_verify_mode
setting is:verify_none
. -
On chef-01-inf-ops.c.gitlab-ops.internal
, open another terminal and tail this log to verify incoming commands are coming to this server:sudo tail -f /var/log/opscode/nginx/access.log
-
Run sudo chef-client -Fmin -W
to do a dry-run against the new server. -
Run sudo chef-client
to converge for real from the new server.
-
-
Upgrade Chef Server to 12.19.31:
-
Download latest chef-server packages cd /tmp wget https://packages.chef.io/files/stable/chef-server/12.19.31/ubuntu/16.04/chef-server-core_12.19.31-1_amd64.deb wget https://packages.chef.io/files/stable/opscode-push-jobs-server/2.2.8/ubuntu/16.04/opscode-push-jobs-server_2.2.8-1_amd64.deb
-
Install, and upgrade the Chef Server ... cd /tmp sudo dpkg -i /tmp/chef-server-core_12.19.31-1_amd64.deb sudo dpkg -i /tmp/opscode-push-jobs-server_2.2.8-1_amd64.deb sudo chef-server-ctl upgrade sudo chef-server-ctl start sudo chef-server-ctl cleanup sudo chef-server-ctl test
-
Run sudo chef-client -Fmin -W
to do a dry-run against the new server. -
Run sudo chef-client
to converge for real from the new server. -
On chef-01-inf-ops.c.gitlab-ops.internal
, edit the/etc/chef/client.rb
file so that the client will point tochef.gitlab.com
and remove thessl_verify_mode
setting. -
Take a snapshot in GCP of the new chef server.
-
-
Update DNS to have all nodes use the new chef server:
-
ensure https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2068 is merged and your checkout is up to date on master
- Update DNS terraform config
-
cd into ~/workspace/gitlab-com-infrastructure/environments/dns
-
Make note of tf state show 'module.gitlab_com.module.a.cloudflare_record.a_aaaa["A_chef.gitlab.com._0"]'
. Copy the output to this issue. We will need it later.-
EXAMPLE OUTPUT (DO NOT USE DURING CHANGE). Use the commented output!
# module.gitlab_com.module.a.cloudflare_record.a_aaaa["A_chef.gitlab.com._0"]: resource "cloudflare_record" "a_aaaa" { [...] }
-
EXAMPLE OUTPUT (DO NOT USE DURING CHANGE). Use the commented output!
-
create a backup of the state tf state pull > tfstate.json.bak
. We do not rely on it, but this surely is nice to have. -
run tf state rm 'module.gitlab_com.module.a.cloudflare_record.a_aaaa["A_chef.gitlab.com._0"]'
The record will not be removed, but terraform's handle on it. The record is now temporarily unmanaged.
-
- Update ops terraform config
-
cd into ~/workspace/gitlab-com-infrastructure/environments/ops
-
Pull the current terraform state via: tf state pull > tfstate.json
-
Create a backup: cp tfstate.json{,.bak}
-
Open tfstate.json
in an editor-
Increase the number of serial
at the top, by 1. -
Search for chef.gitlab.net
(yes,.net
, this was a placeholder) -
You should find an object with index_key
beingA_chef.gitlab.net_0
. This is the object you want. -
Update this index_key
toA_chef.gitlab.com_0
. -
Adjust the following values of the attributes
to the ones extracted from the DNS environments record:-
created_on
-
hostname
-
id
-
modified_on
-
name
-
zone_id
- leave the rest alone
-
-
Save the changes.
-
-
Diff the modified file and the backup diff -u3 tfstate.json{,.bak}
- It should look similar to this:
--- tfstate.json 2020-09-18 15:05:57.819934502 +0200 +++ tfstate.json.bak 2020-09-18 15:01:11.913046269 +0200 @@ -1,7 +1,7 @@ { "version": 4, "terraform_version": "0.12.20", - "serial": 794, + "serial": 793, "lineage": "1d77d621-9012-4305-8aff-faaea726cac8", "outputs": { "ops_ip": { @@ -3680,28 +3680,28 @@ "provider": "provider.cloudflare", "instances": [ { - "index_key": "A_chef.gitlab.com_0", + "index_key": "A_chef.gitlab.net_0", "schema_version": 1, "attributes": { - "created_on": "2020-02-19T17:36:11.447274Z", + "created_on": "2020-03-16T14:55:37.172119Z", "data": {}, - "hostname": "chef.gitlab.com", - "id": "ff3c4011d31559f42183612a67e23f4e", + "hostname": "chef.gitlab.net", + "id": "bab4211f168d8f50b84e5e7ea2236595", "metadata": { "auto_added": "false", "managed_by_apps": "false", "managed_by_argo_tunnel": "false", "source": "primary" }, - "modified_on": "2020-02-19T17:36:11.447274Z", - "name": "chef.gitlab.com", + "modified_on": "2020-03-16T14:55:37.172119Z", + "name": "chef.gitlab.net", "priority": 0, "proxiable": true, "proxied": false, "ttl": 300, "type": "A", "value": "35.224.62.114", - "zone_id": "5b1bc06af128a829167e3a1212d86c28" + "zone_id": "736d3823d5085e26501019cbc313b20e" }, "private": "REDACTED", "dependencies": [
- It should look similar to this:
-
Update the remote state via tf state push tfstate.json
-
Run a tf plan -out plan
- You should see an IP change for
module.chef-lb.module.dns_record_external.module.a.cloudflare_record.a_aaaa["A_chef.gitlab.com_0"]
- You should see an IP change for
-
Apply the plan via tf apply plan
-
chef.gitlab.com
now points to the new node -
chef.gitlab.net
is unmanaged by Terraform and needs to be manually deleted After this change, we expect to see alert recoveries from the chef client converges being successful.
-
-
-
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Verify that this knife vault command works and the SHA response matches the result from the old chef server: knife vault show gitlab-sentry prd -Fjson | sha512sum
a6edd8006a38c9215830bedfa3623d93ef5583ace8104817c4b0b8b8c5d9d8cb0ff3a4bbbded58e7da9b1198e4df64a3659debf95cfa7eedbac27da6718d6223
-
SSH into a node like customers.gitlab.com
and test chef (this is a node that lives in Azure and runs an old version of Chef Client).-
Verify that the DNS is resolving to the new chef server's IP: nslookup chef.gitlab.com
-
Perform a chef dry run: sudo chef-client -Fmin -W
-
Converge Chef: sudo chef-client
-
-
Copy the backup of the old chef server to the chef-migration-backup-2711
bucket ingitlab-ops
.gsutil cp BACKUP_FILE gs://chef-migration-backup-2711/
-
Power off or stop the old chef server in Digital Ocean. -
Create a snapshot of the old chef server in Digital Ocean. -
Clean up temporary files from new chef server under tmp
cd /tmp rm -r chef_backup*
-
Go to the Cloudflare UI -
Identify the record chef.gitlab.net
(.NET
*!!!*) and delete it.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Restore the backed up Terraform state from above. -
cd into ~/workspace/gitlab-com-infrastructure/environments/ops
-
tf state push ./tfstate.json.bak
-
-
Re-create the port access in DO for the chef node: https://cloud.digitalocean.com/networking/firewalls/ba41375b-775f-44fc-b318-69c6ffd341ce/rules?i=52f790
Use the screenshot from the pre-steps to make sure the correct access is added back. -
Start/verify that the Digital Ocean Droplet is running and verify it's IP address. -
If the IP address is not 128.199.60.225
, update the address in cloudflare and create a new terraform MR to make this change permanent.
-
-
Verify that the chef service is up and running with sudo chef-server-ctl start
-
Run sudo chef-client
onchef.gitlab.com
to make sure chef is working. -
Revert and Review changes for TF and correct if needed: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2068
Monitoring
Key metrics to observe
- Metric: Operations / Chef Client
- Location: https://dashboards.gitlab.net/d/000000231/chef-client?orgId=1&refresh=1m
- What changes to this metric should prompt a rollback: We should see nodes not converging during the change, but recovering as the new server is updated in DNS.
- Metric: Thanos Query of Chef Errors
- Location: https://thanos-query.ops.gitlab.net/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=chef_client_error%3E0&g0.tab=0
- What changes to this metric should prompt a rollback: This should show more errors as we migrate and over the next 30 minutes after the DNS change, it should go back down to zero (or close to it).
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.