Improve SET On-call Process
Summary
This issue tracks the effort to improve the current Test Platform On-call DRI process.
Problem Statement
- SETs are dedicating over two weeks+ each quarter to on-call duties, accounting for
15-20%
of their work allocation every quarter, which significantly impacts the team's overall productivity. - There's an inconsistency in issue triage. A quick search shows that we have currently 182 open issues that are not triaged (they have failurenew label) and many of those are over an year old.
- There is an absence of a robust dashboard or metrics system, hindering our ability to make informed, data-driven decisions regarding on-call duties and issue management.
- Product teams are not sufficiently integrated into the on-call process, which centralizes the responsibility to a few SETs and misses opportunities for wider team engagement and ownership.
- More getting discussed here - #2543 (comment 1827631679)
Goals
- Refine the on-call support process to alleviate the time commitment required from SETs, redistributing this time towards proactive work.
- Implement a standardized triage protocol to ensure timely and regular review of open issues.
- Develop a comprehensive dashboard or set of metrics to provide real-time visibility into the on-call process and issue status.
- Empower product teams to assume on-call duties, fostering a sense of ownership and accountability for product-related issues.
Success Metrics
- A reduction in the SET on-call time allocation by at least 50%
- A sustained decrease in open, untriaged issues by at least 80%
- Establishment of a dashboard that tracks on-call activities, response times, and issue resolution metrics
- At least a 30% transition of on-call duties to product teams, as measured by the number of incidents handled directly by product teams instead of SETs.
cc: @gl-quality
Edited by Abhinaba Ghosh