Investigate alerting in case where postgres_exporter cannot make database conjections

Recently we had an outage caused by bad SQL which overloaded the database. It caused every database connection to be used by excessively long-running SQL which prevented postgres_exporter from being able to make a connection:

  • The Database alerts that fired were initially confusing. They alerted that the number of Postgres databases had changed because the overloaded databases simply stopped reporting metrics in Prometheus.
    • We can investigate how to make the postgres_exporter more reliable on an overloaded database. Using a persistent database connection or some way to reserve database connections for it might be necessary.
    • We should look into why the alerts for missing prometheus scrapes didn't fire.
    • We should consider paging on "number of databases changed". We can build silences into the manual failover process and HA failovers are uncommon enough not to create too much noise.