Corrective action: Improve time to root cause for PatroniServiceRailsPrimarySqlApdexSLOViolation
Summary
In production#8311 (closed), we found the root cause after about 3 hours
- How can we reduce time to root cause ?
- See gitlab-org/gitlab!110447 (comment 1256176703) for sets of graphs
Related Incident(s)
Originating issue(s): production#8311 (closed)
Desired Outcome/Acceptance Criteria
We mitigated the incident rather early by rolling back the deployment, which was lucky.
We then spent a lot of time figuring what diff caused the issue. I remember we were stumped that total time spent on SQL (pg_stat_statements_seconds_total
), and pg_stat_statements_calls
were low.
Are we missing some metric/dashboard, or a runbook entry ?
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')
Edited by Thong Kuah