Restart Consul agents to apply TLS configrations
|Change Objective||Mitigate the risk that Consul agents lose existing TLS connections to the Consul server and are unable to reconnect and verify the validity of the certificate, which expired on 03 August 2019. A detailed look at the specific failure scenarios is available here: #1037 (comment 201745119).|
|Change Type||Host reconfiguration and service restart|
|Services Impacted||All services. Every VM in the production environment that is running the consul agent. As a default setting, this implies all VMs.|
|Change Team Members||@ansdval @devin @msmiley @cmiskell @cindy @nnelson|
|Tested in staging|
|Schedule of the change||2019-09-09 02:00 UTC|
|Duration of the change||1m ( including a possible rollback )|
|Downtime Component||Yes, 1-3m|
|Detailed steps for the change. Each step must include:||See below.|
What do we require for an initial state of the environment?
All Patroni nodes must be in a healthy state–minimally a majority have working interaction with their consul agents with lag on the replicas < ~n GiB.
Generate a full list of servers running the consul agent.
Verify all servers are running with UTC time.
atcommand exists on all servers.
Ensure all hosts are using the same NTP servers and not exceeding ~20 seconds of drift.
sudoworks using the
Write the script that will update the
consul.jsonfiles on all hosts running the local agent, as well as all consul servers. See execution details below for desired configuration changes.
sshguardon consul servers.
sshguardis disabled on consul servers.
gitlab-patronictl pauseto prevent failover of the master.
Run the script to rewrite
/etc/consul/consul.jsonon all hosts where consul is installed–and also
falseon consul servers:
mussh -m 10 -b -H host_list.gprd.all_consul_agents_and_servers.txt -C stop_chef_client_and_alter_consul_config.sh
/etc/consul/consul.jsonon the consul servers and a random sample of clients.
Verify the chef client is disabled on ALL servers (especially
pgbouncer-01from behind the Google ILB.
mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "echo 'sudo systemctl restart consul.service' |at HH:MM"
Run a watch on consul servers and Patroni hosts viewing
consul operator raft list-peersto verify there is a leader, if this fails the error message will likely look like this:
Unexpected response code: 500.
gitlab-patronictl listto determine whether the master failed over. If it did, we need to immediately update the PGBouncer config to communicate with the new master, preferably using consul DNS.
If connectivity to Patroni master is failing, issue
echo 'PAUSE;' | sudo /usr/local/bin/pgb-consoleto temporarily hold requests to the database.
If PGBouncer connections were paused, verify Patroni has an elected master and that consul DNS is resolving on the PGBouncer servers before issuing the command
echo 'RESUME;' | sudo /usr/local/bin/pgb-consoleto resume sending requests to the database.
gitlab-patronictl resumeto allow for failover.
Run the pipeline for our Break Glass MR. This will push the chef changes to match the
falsesettings for the verify lines in the config files.
Verify a run of the chef-client on a low-impact node to ensure that
consul.jsonfile is written properly by cookbook.
Enable the chef-client on
mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "hostname; sudo chef-client"
pgbouncer-01back behind the Google ILB and run
sudo systemctl start pgbouncer-leader-check.
After this change, the consul clients will still be using TLS encryption but will ignore the fact that our cert is now expired.