Issues found during Consul cluster migration to k8s

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

During the work to migrate Consul from VMs to k8s, we ran into a couple of issues:

use_tcp: false (ie. UDP) resulted in different results vs using TCP for DNS lookups

We switched the app to use UDP for DNS lookups as it's lower overhead, and we assumed this would not impact the application, however it was reported that the app would only ever get 3 records back when querying endpoints like db-replica.service.consul when using UDP.

Impact to users when restarting Consul agents in Patroni clusters & PG Bouncers

We run Consul agents as a daemonset in k8s (provided by the consul-k8s chart) and we also deploy a Consul agent in each VM. We have Patroni configured to use Consul as its Distributed Configuration Store, and we have a service configured in every Patroni instance that performs a healthcheck to determine if the current node is the leader. If it is, then an endpoint like master.patroni.service.consul resolves to that host. This endpoint is used by PGBouncer to talk to the leader for read/write queries.

When upgrading Consul, we need to restart the Consul process, which probably impacts the app in unknown ways:

  1. When Consul is restarted in Patroni, this read/write endpoint (eg. master.patroni.service.consul) becomes unavailable for 15-20 seconds. Presumably this wouldn't affect existing connections, but any new connections would be affected? What would be the impact on the app if queries via PGBouncers failed for this period?

  2. When Consul is restarted on the PGBouncers, any DNS queries would fail for the duration of the restart (probably quicker than 15-20 seconds). Impact? One idea here is that we could configure dnsmasq to talk to Consul on localhost, but failback to the Consul server cluster.

Impact

We need to get into the habit of upgrading Consul agents regularly to avoid falling behind on versions so any work we can do to make this boring would be great.

Recommendation

Verification

Edited by 🤖 GitLab Bot 🤖