Prometheus sometimes doesn't refresh it's alertmanager IPs when they cycle
Full details in production#5532 (closed) (which is a repeat of production#5326 (closed) and production#5441 (closed)), but in short, it appears that prometheus (at least 2.27.0) sometimes doesn't actually refresh the IP addresses of alertmanager obtained via dns_sd_config configuration. It appears to be querying them (circumstantial evidence via tcpdump), but sticks with old entries.
So when alertmanager cycles all its pods (deploys/updates, or perhaps node upgrades) some of our prometheus deploys can end up trying to talk to non-existent IPs, even though the DNS queries they're making have the new correct IP addresses. However it doesn't happen to all prometheus instances; in the 3 instances it is consistently prometheus-0{1,2}-inf-gprd (not db or app instances), and the org-ci GKE installation. This may be a clue
Past incidents
Potential solution
Use SRV record for discovery of ClusterIP endpoints instead of Pod IPs.
-
Create a SRV record in Terraform pointing to the internal DNS endpoint of each Alertmanager SVC (3 total) -
Update Alertmanager discovery method in Prometheus to use SRV DNS record https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config