Container Registry DB load balancing DNS lookup failures
Context
This was discovered while investigating https://app.incident.io/gitlab/incidents/2480.
Problem
We're seeing a somehow small but regular stream of errors when trying to resolve DNS records for database load balancing on the container registry. The feature is documented in detail here.
The registry starts by performing an SRV lookup against the configured record for replicas (https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/2bd12f9e9118e8f02c34e42136106d71bc9be05b/releases/gitlab/values/gprd.yaml.gotmpl#L116) and then for each returned server it performs an Host lookup.
We're seeing mostly the following errors:
failed to resolve replica hosts: error resolving DNS SRV record: lookup replica.patroni-registry.service.consul. on 10.221.4.10:53: dial tcp: lookup consul-gl-consul-dns.consul.svc.cluster.local: i/o timeouterror resolving host "patroni-registry-v16-03-db-gprd.node.east-us-2.consul." address: lookup patroni-registry-v16-03-db-gprd.node.east-us-2.consul. on 10.67.0.10:53: no such host
The list of errors can be found here: https://log.gprd.gitlab.net/app/r/s/0ILyE
Ask
We need Infra help to determine the reason for these apparently random i/o timeout errors for SRV lookups followed by no such host errors for Host lookups, all performed against Consul.