You need to sign in or sign up before continuing.
Build High Available (Federated) prometheus monitoring solution
We need to be smarter with monitoring, if it goes down we are blind, so (unless someone has a better idea) this is how it should look like:
Prometheus instances
- We need to have at least 2 instances of internal prometheus servers.
- The first instance
- Should be used to scrape all the endpoints we are scraping (the exporters)
- Should have a short data retention (not go too far in time)
- Should scrape the second instance only to see if it is up and it is scraping targets.
- The second instance
- Should be scraping the first instance (and the external instance).
- Should have a longer data retention.
- Should run recorded queries for data aggregation and trends.
Alert managers
- We also need to have at least 2 alert managers running on each of these instances.
- The first instance will alert for whatever is on fire here and now, including the second instance being down.
- The second instance will only alert with a critical page if the first instance is down.
General guidelines
- We need to be sure that our monitoring is alive.
- We need to have clear runbooks for every alert.
- We need to have clear documentation on how to build alerts for every part of the infrastructure.
- We need to have both short term monitoring, and longer term with trends calculations.
- We need to have clear documentation showing how to access this monitoring.