Build High Available (Federated) prometheus monitoring solution
We need to be smarter with monitoring, if it goes down we are blind, so (unless someone has a better idea) this is how it should look like:
Prometheus instances
- We need to have at least 2 instances of internal prometheus servers.
- The first instance
- Should be used to scrape all the endpoints we are scraping (the exporters)
- Should have a short data retention (not go too far in time)
- Should scrape the second instance only to see if it is up and it is scraping targets.
- The second instance
- Should be scraping the first instance (and the external instance).
- Should have a longer data retention.
- Should run recorded queries for data aggregation and trends.
Alert managers
- We also need to have at least 2 alert managers running on each of these instances.
- The first instance will alert for whatever is on fire here and now, including the second instance being down.
- The second instance will only alert with a critical page if the first instance is down.
General guidelines
- We need to be sure that our monitoring is alive.
- We need to have clear runbooks for every alert.
- We need to have clear documentation on how to build alerts for every part of the infrastructure.
- We need to have both short term monitoring, and longer term with trends calculations.
- We need to have clear documentation showing how to access this monitoring.