FQDN/Hostname changing intermittently during chef-client run causing metrics exporter labels to be changed
Summary
We got paged twice recently by traffic cessation alerts coming from Gitaly storage nodes because those nodes had their fqdn changed for a short period of time after a chef-client run.
The metric exporter was putting a label with hostname without domain part and when it got reverted (on its own), it caused traffic cessation alerts.
@stanhu mentioned about intermittent DNS resolution failures in the past, which may give us some clue into this issue.
@igorwwwwwwwwwwwwwwwwwwww has suggested to change the log level for DNS resolver on the servers to get some log evidence if its related to DNS query failures.
Related Incident(s)
- 2022-10-11: gprd-clean-cache failing due to hos... (production#7863 - closed)
- 2022-10-19: Intermittent metrics reported by fi... (production#7898 - closed)
- 2023-03-16: GitalyServiceGoserverTrafficAbsentS... (production#8552 - closed)
- 2023-12-14: missing traffic on single Gitaly VM (production#17279 - closed)
- 2023-09-15: GitalyServiceGoserverTrafficAbsentS... (production#16378 - closed)
- 2023-12-30: Metrics labeling inconsistency on g... (production#17335 - closed)
- 2024-06-01: The goserver SLI of the gitaly serv... (production#18094 - closed)