Prometheus sometimes doesn't refresh it's alertmanager IPs when they cycle

Full details in production#5532 (closed) (which is a repeat of production#5326 (closed) and production#5441 (closed)), but in short, it appears that prometheus (at least 2.27.0) sometimes doesn't actually refresh the IP addresses of alertmanager obtained via dns_sd_config configuration. It appears to be querying them (circumstantial evidence via tcpdump), but sticks with old entries.

So when alertmanager cycles all its pods (deploys/updates, or perhaps node upgrades) some of our prometheus deploys can end up trying to talk to non-existent IPs, even though the DNS queries they're making have the new correct IP addresses. However it doesn't happen to all prometheus instances; in the 3 instances it is consistently prometheus-0{1,2}-inf-gprd (not db or app instances), and the org-ci GKE installation. This may be a clue

Past incidents

production#7120 (closed)

Potential solution

Use SRV record for discovery of ClusterIP endpoints instead of Pod IPs.

Create a SRV record in Terraform pointing to the internal DNS endpoint of each Alertmanager SVC (3 total)
Update Alertmanager discovery method in Prometheus to use SRV DNS record https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config

Edited May 24, 2022 by Steve Xuereb