Address resolution of SRV records for database load balancing should use "Additional Section" A records supplied with SRV record
See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7656#note_208682133
See https://gitlab.com/gitlab-org/gitlab-ee/issues/13732#note_208700843
In https://gitlab.com/gitlab-org/gitlab-ee/issues/13732, we added support for SRV records for Database Load Balancing
Currently, this uses a two step resolution:
-
SRV
record is resolved to a hostname and port using Consul DNS interface (port 8600) - Hostname is resolved using the system resolver
Unfortunately @ahmadsherif and @mwasilewski-gitlab found, on testing on GitLab.com that the hostnames provided by Consul do not necessarily resolve using the system resolver.
We discussed using Consul as the system resolver or "routing" the queries to Consul, but these approaches are fragile and lead to more potential breaking points in the application without offering many advantages.
However, it was pointed out that the SRV
record response does in fact include the A
record (and IP target address) in the original response, so we should use this.
$ dig @127.0.0.1 -p 8600 replica.patroni.service.consul. SRV
; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 replica.patroni.service.consul. SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54730
;; flags: qr aa rd; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 17
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;replica.patroni.service.consul. IN SRV
;; ANSWER SECTION:
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-02-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-08-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-11-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-09-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-03-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-05-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-06-db-gprd.node.east-us-2.consul.
replica.patroni.service.consul. 0 IN SRV 1 1 0 patroni-10-db-gprd.node.east-us-2.consul.
;; ADDITIONAL SECTION:
patroni-02-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.102
patroni-02-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-08-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.108
patroni-08-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-11-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.111
patroni-11-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-09-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.109
patroni-09-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-03-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.103
patroni-03-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-05-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.105
patroni-05-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-06-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.106
patroni-06-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
patroni-10-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.110
patroni-10-db-gprd.node.east-us-2.consul. 0 IN TXT "consul-network-segment="
;; Query time: 3 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Aug 27 15:06:30 UTC 2019
;; MSG SIZE rcvd: 955
This can be seen above in the **;; ADDITIONAL SECTION: ** section.
The following Ruby snippet shows how to use this from Ruby:
resolver = Net::DNS::Resolver.new(nameservers: '127.0.0.1', port: 8600, use_tcp: true)
response = resolver.search('pgbouncer.service.consul.', Net::DNS::SRV)
puts response.additional.find { |r| r.is_a? Net::DNS::RR::A }.address
We should amend https://gitlab.com/gitlab-org/gitlab-ee/issues/13732 to use the supplied additional A
record. This has several advantages:
- Still only 1 SRV record query to DNS
- Service record and IP resolution occur from the same source, so much less fragile than other approaches discussed (these included DNSMasq and using Consul at the system resolver)