FQDN/Hostname changing intermittently during chef-client run causing metrics exporter labels to be changed
Summary
We got paged twice recently by traffic cessation alerts coming from Gitaly storage nodes because those nodes had their fqdn
changed for a short period of time after a chef-client
run.
The metric exporter was putting a label with hostname without domain part and when it got reverted (on its own), it caused traffic cessation alerts.
@stanhu mentioned about intermittent DNS resolution failures in the past, which may give us some clue into this issue.
@igorwwwwwwwwwwwwwwwwwwww has suggested to change the log level for DNS resolver on the servers to get some log evidence if its related to DNS query failures.
Related Incident(s)
- 2022-10-11: gprd-clean-cache failing due to hos... (production#7863 - closed)
- 2022-10-19: Intermittent metrics reported by fi... (production#7898 - closed)
- 2023-03-16: GitalyServiceGoserverTrafficAbsentS... (production#8552 - closed)
- 2023-12-14: missing traffic on single Gitaly VM (production#17279 - closed)
- 2023-09-15: GitalyServiceGoserverTrafficAbsentS... (production#16378 - closed)
- 2023-12-30: Metrics labeling inconsistency on g... (production#17335 - closed)
Edited by Steve Xuereb