Create alerting for the database instance prior to completely filling up the disk space
Summary
There was a period of time during 2023-11-22: High database load caused slow and ... (production#17168 - closed) where no db metrics were available, as the primary instance ran out of disk space with the large number of queries being run and processed and subsequently logged.
Note: EOC was NOT paged when disk space reached 100%, and alert was only seen on #production
. Both paging and EARLIER alerting (e.g. hitting 80-90% threshold) would be expected.
Related Incident(s)
2023-11-22: High database load caused slow and ... (production#17168 - closed)
Desired Outcome/Acceptance Criteria
Have monitoring/alerting in place so that we know prior to reaching 100% disk space usage when numerous queries are being executed by the database, and its logs are filling up the disk space.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident) -
Assign a service label
Edited by Cheryl Li