Restart Consul agents to apply TLS configrations
C1
Production Change - Criticality 1Change Objective | Mitigate the risk that Consul agents lose existing TLS connections to the Consul server and are unable to reconnect and verify the validity of the certificate, which expired on 03 August 2019. A detailed look at the specific failure scenarios is available here: #1037 (comment 201745119). |
---|---|
Change Type | Host reconfiguration and service restart |
Services Impacted | All services. Every VM in the production environment that is running the consul agent. As a default setting, this implies all VMs. |
Change Team Members | @ansdval @devin @msmiley @cmiskell @cindy @nnelson |
Change Severity | C1 |
Buddy check | |
Tested in staging | |
Schedule of the change | 2019-09-09 02:00 UTC |
Duration of the change | 1m ( including a possible rollback ) |
Downtime Component | Yes, 1-3m |
Detailed steps for the change. Each step must include: | See below. |
Prerequisites
What do we require for an initial state of the environment?
-
All Patroni nodes must be in a healthy state–minimally a majority have working interaction with their consul agents with lag on the replicas < ~n GiB. -
Generate a full list of servers running the consul agent. -
Verify all servers are running with UTC time. -
Verify the at
command exists on all servers. -
Ensure all hosts are using the same NTP servers and not exceeding ~20 seconds of drift. -
Test how sudo
works using theat
command. -
Write the script that will update the consul.json
files on all hosts running the local agent, as well as all consul servers. See execution details below for desired configuration changes. -
Disable sshguard
on consul servers.
Execution Steps
-
Verify that sshguard
is disabled on consul servers. -
Run gitlab-patronictl pause
to prevent failover of the master. -
Run the script to rewrite verify_outgoing
tofalse
in/etc/consul/consul.json
on all hosts where consul is installed–and alsoverify_incoming
tofalse
on consul servers:mussh -m 10 -b -H host_list.gprd.all_consul_agents_and_servers.txt -C stop_chef_client_and_alter_consul_config.sh
-
Manually verify verify_incoming
andverify_outgoing
tofalse
in/etc/consul/consul.json
on the consul servers and a random sample of clients. -
Verify the chef client is disabled on ALL servers (especially pgbouncer
,patroni
andconsul
). -
Remove pgbouncer-01
from behind the Google ILB. -
Run mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "echo 'sudo systemctl restart consul.service' |at HH:MM"
Verification Steps
-
Run a watch on consul servers and Patroni hosts viewing consul operator raft list-peers
to verify there is a leader, if this fails the error message will likely look like this:Unexpected response code: 500
. -
Run gitlab-patronictl list
to determine whether the master failed over. If it did, we need to immediately update the PGBouncer config to communicate with the new master, preferably using consul DNS. -
If connectivity to Patroni master is failing, issueecho 'PAUSE;' | sudo /usr/local/bin/pgb-console
to temporarily hold requests to the database. -
If PGBouncer connections were paused, verify Patroni has an elected master and that consul DNS is resolving on the PGBouncer servers before issuing the commandecho 'RESUME;' | sudo /usr/local/bin/pgb-console
to resume sending requests to the database.
Cleanup Steps
-
Run gitlab-patronictl resume
to allow for failover. -
Run the pipeline for our Break Glass MR. This will push the chef changes to match the false
settings for the verify lines in the config files. -
Verify a run of the chef-client on a low-impact node to ensure that consul.json
file is written properly by cookbook. -
Enable the chef-client on patroni
,pgbouncer
, andconsul
servers. -
Run mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "hostname; sudo chef-client"
-
Put pgbouncer-01
back behind the Google ILB and runsudo systemctl start pgbouncer-leader-check
.
After this change, the consul clients will still be using TLS encryption but will ignore the fact that our cert is now expired.