Remove catch all rule for prometheus target down
Currently we have a very noisy check in place for checking if prometheus targets are up.
It acts as a catch all as there is no specific filtering on it.
https://gitlab.com/gitlab-com/runbooks/-/blob/master/rules/target-is-down.yml.
If we look over a week, many of the alerts triggered by this rule are consistently under the threshold, or even flat lined completely.
This has resulted in a lot of excess alert noise that is just being ignored.
Generally speaking an alert like this would be fine where service discovery is in use - e.g Kubernetes SD.
As in this situation the targets are dynamically updated to reflect the real state of discovered endpoints.
What remains and can often lead to un-audited noise is static targets that have not been cleaned up.
Ideally we should remove such a catch all type rule, and where target up checking is important it should be specifically targeted to the service.
This ensures the alerting rule is tied to a service, and is much easier to audit and route appropriately to the required parties.
Questions:
- Is anyone actively relying on these alerts currently (they are s3)?
- As they are S3, would anyone be apposed to also re-routing those target checks that are required to a specific service alert rule, as well as routing those to a specific channel rather than the generalised
alertschannel?
