Skip to content

Restart Consul agents to apply TLS configrations

Production Change - Criticality 1 C1

Change Objective Mitigate the risk that Consul agents lose existing TLS connections to the Consul server and are unable to reconnect and verify the validity of the certificate, which expired on 03 August 2019. A detailed look at the specific failure scenarios is available here: #1037 (comment 201745119).
Change Type Host reconfiguration and service restart
Services Impacted All services. Every VM in the production environment that is running the consul agent. As a default setting, this implies all VMs.
Change Team Members @ansdval @devin @msmiley @cmiskell @cindy @nnelson
Change Severity C1
Buddy check
Tested in staging
Schedule of the change 2019-09-09 02:00 UTC
Duration of the change 1m ( including a possible rollback )
Downtime Component Yes, 1-3m
Detailed steps for the change. Each step must include: See below.

Prerequisites

What do we require for an initial state of the environment?

  1. All Patroni nodes must be in a healthy state–minimally a majority have working interaction with their consul agents with lag on the replicas < ~n GiB.

  2. Generate a full list of servers running the consul agent.

  3. Verify all servers are running with UTC time.

  4. Verify the at command exists on all servers.

  5. Ensure all hosts are using the same NTP servers and not exceeding ~20 seconds of drift.

  6. Test how sudo works using the at command.

  7. Write the script that will update the consul.json files on all hosts running the local agent, as well as all consul servers. See execution details below for desired configuration changes.

  8. Disable sshguard on consul servers.

Execution Steps

  1. Verify that sshguard is disabled on consul servers.

  2. Run gitlab-patronictl pause to prevent failover of the master.

  3. Run the script to rewrite verify_outgoing to false in /etc/consul/consul.json on all hosts where consul is installed–and also verify_incoming to false on consul servers: mussh -m 10 -b -H host_list.gprd.all_consul_agents_and_servers.txt -C stop_chef_client_and_alter_consul_config.sh

  4. Manually verify verify_incoming and verify_outgoing to false in /etc/consul/consul.json on the consul servers and a random sample of clients.

  5. Verify the chef client is disabled on ALL servers (especially pgbouncer, patroni and consul).

  6. Remove pgbouncer-01 from behind the Google ILB.

  7. Run mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "echo 'sudo systemctl restart consul.service' |at HH:MM"

Verification Steps

  1. Run a watch on consul servers and Patroni hosts viewing consul operator raft list-peers to verify there is a leader, if this fails the error message will likely look like this: Unexpected response code: 500.

  2. Run gitlab-patronictl list to determine whether the master failed over. If it did, we need to immediately update the PGBouncer config to communicate with the new master, preferably using consul DNS.

  3. If connectivity to Patroni master is failing, issue echo 'PAUSE;' | sudo /usr/local/bin/pgb-console to temporarily hold requests to the database.

  4. If PGBouncer connections were paused, verify Patroni has an elected master and that consul DNS is resolving on the PGBouncer servers before issuing the command echo 'RESUME;' | sudo /usr/local/bin/pgb-console to resume sending requests to the database.

Cleanup Steps

  1. Run gitlab-patronictl resume to allow for failover.

  2. Run the pipeline for our Break Glass MR. This will push the chef changes to match the false settings for the verify lines in the config files.

  3. Verify a run of the chef-client on a low-impact node to ensure that consul.json file is written properly by cookbook.

  4. Enable the chef-client on patroni, pgbouncer, and consul servers.

  5. Run mussh -m 20 -b -H host_list.gprd.all_consul_agents_and_servers.txt -c "hostname; sudo chef-client"

  6. Put pgbouncer-01 back behind the Google ILB and run sudo systemctl start pgbouncer-leader-check.

After this change, the consul clients will still be using TLS encryption but will ignore the fact that our cert is now expired.

Edited by Craig Miskell