Skip to content

Build High Available (Federated) prometheus monitoring solution

We need to be smarter with monitoring, if it goes down we are blind, so (unless someone has a better idea) this is how it should look like:

Prometheus instances

  • We need to have at least 2 instances of internal prometheus servers.
  • The first instance
    • Should be used to scrape all the endpoints we are scraping (the exporters)
    • Should have a short data retention (not go too far in time)
    • Should scrape the second instance only to see if it is up and it is scraping targets.
  • The second instance
    • Should be scraping the first instance (and the external instance).
    • Should have a longer data retention.
    • Should run recorded queries for data aggregation and trends.

Alert managers

  • We also need to have at least 2 alert managers running on each of these instances.
  • The first instance will alert for whatever is on fire here and now, including the second instance being down.
  • The second instance will only alert with a critical page if the first instance is down.

General guidelines

  • We need to be sure that our monitoring is alive.
  • We need to have clear runbooks for every alert.
  • We need to have clear documentation on how to build alerts for every part of the infrastructure.
  • We need to have both short term monitoring, and longer term with trends calculations.
  • We need to have clear documentation showing how to access this monitoring.