Container Registry DB load balancing DNS lookup failures

Context

This was discovered while investigating https://app.incident.io/gitlab/incidents/2480.

Problem

We're seeing a somehow small but regular stream of errors when trying to resolve DNS records for database load balancing on the container registry. The feature is documented in detail here.

The registry starts by performing an SRV lookup against the configured record for replicas (https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/2bd12f9e9118e8f02c34e42136106d71bc9be05b/releases/gitlab/values/gprd.yaml.gotmpl#L116) and then for each returned server it performs an Host lookup.

We're seeing mostly the following errors:

  1. failed to resolve replica hosts: error resolving DNS SRV record: lookup replica.patroni-registry.service.consul. on 10.221.4.10:53: dial tcp: lookup consul-gl-consul-dns.consul.svc.cluster.local: i/o timeout
  2. error resolving host "patroni-registry-v16-03-db-gprd.node.east-us-2.consul." address: lookup patroni-registry-v16-03-db-gprd.node.east-us-2.consul. on 10.67.0.10:53: no such host

The list of errors can be found here: https://log.gprd.gitlab.net/app/r/s/0ILyE

Ask

We need Infra help to determine the reason for these apparently random i/o timeout errors for SRV lookups followed by no such host errors for Host lookups, all performed against Consul.

Edited by João Pereira