Create alerting for the database instance prior to completely filling up the disk space

Summary

There was a period of time during 2023-11-22: High database load caused slow and ... (production#17168 - closed) where no db metrics were available, as the primary instance ran out of disk space with the large number of queries being run and processed and subsequently logged.

Note: EOC was NOT paged when disk space reached 100%, and alert was only seen on #production. Both paging and EARLIER alerting (e.g. hitting 80-90% threshold) would be expected.

Related Incident(s)

2023-11-22: High database load caused slow and ... (production#17168 - closed)

Desired Outcome/Acceptance Criteria

Have monitoring/alerting in place so that we know prior to reaching 100% disk space usage when numerous queries are being executed by the database, and its logs are filling up the disk space.

Associated Services

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose from
Give context for what problem this corrective action is trying to prevent re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
Assign a service label

Edited Nov 23, 2023 by Cheryl Li