Remove catch all rule for prometheus target down

Currently we have a very noisy check in place for checking if prometheus targets are up.
It acts as a catch all as there is no specific filtering on it.

https://gitlab.com/gitlab-com/runbooks/-/blob/master/rules/target-is-down.yml.

If we look over a week, many of the alerts triggered by this rule are consistently under the threshold, or even flat lined completely.

image

source.

This has resulted in a lot of excess alert noise that is just being ignored.

Generally speaking an alert like this would be fine where service discovery is in use - e.g Kubernetes SD.
As in this situation the targets are dynamically updated to reflect the real state of discovered endpoints.

What remains and can often lead to un-audited noise is static targets that have not been cleaned up.
Ideally we should remove such a catch all type rule, and where target up checking is important it should be specifically targeted to the service.
This ensures the alerting rule is tied to a service, and is much easier to audit and route appropriately to the required parties.

Questions:

  • Is anyone actively relying on these alerts currently (they are s3)?
  • As they are S3, would anyone be apposed to also re-routing those target checks that are required to a specific service alert rule, as well as routing those to a specific channel rather than the generalised alerts channel?
Edited by Nick Duff