Improve Consul health checks for database nodes so that unhealthy replicas are dropped from the pool

Summary

We should improve the Consul health checks for the database to reduce the impact of scenarios like in production#6253 (closed) where replicas go unhealthy and we still attempt to use them. This behavior resulted in a full site outage. One or two databases that are not healthy and have slow query times can cause worker saturation on the front-end which is what we saw in the referenced incident.

As seen in this comment Consul health checks continued to pass well into the event and didn't fail until we forced the replicas into maintenance.

There is a risk that by enhancing this check we may unintentionally cause the database to drop out completely, so we would need to safeguard against that.

Related Incident(s)

Originating issue(s): production#6253 (closed)

Desired Outcome/Acceptance criteria

Associated Services

Corrective Action Issue Checklist

  • link the incident(s) this corrective action arose out of
  • give context for what problem this corrective action is trying to prevent from re-occurring
  • assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • assign a priority (this will default to 'priority::4')
Edited by John Jarvis