Investigate alerting in case where postgres_exporter cannot make database conjections
Recently we had an outage caused by bad SQL which overloaded the database. It caused every database connection to be used by excessively long-running SQL which prevented postgres_exporter from being able to make a connection:
- The Database alerts that fired were initially confusing. They alerted that the number of Postgres databases had changed because the overloaded databases simply stopped reporting metrics in Prometheus.
- We can investigate how to make the postgres_exporter more reliable on an overloaded database. Using a persistent database connection or some way to reserve database connections for it might be necessary.
- We should look into why the alerts for missing prometheus scrapes didn't fire.
- We should consider paging on "number of databases changed". We can build silences into the manual failover process and HA failovers are uncommon enough not to create too much noise.