Improve Quality and Scaling of Continuous Integration => 65%
Current list of things we have planned as actions
-
1. Sam - Speaking tour - It is important that all the teams in the group are aware of the impact recent bugs have had in production. In the immediate terms we need take special precautions to prevent these bugs from recurring. We also need to address broader quality and risk assessment issues in the short term. Sam is planning to attend all team staff meetings in the group (with some support from folks in other timezones) to discuss the recent incidents and the need for an increased focus on identifying and mitigating top risks for each team. A goal will be to solicit each team's input in determining the best course of action, while also emphasizing the urgency with which we need to take action and see progress in this area. - Agenda item for team staff meetings: https://docs.google.com/document/d/1u6LladnnUfpClcEsAZtV-67Y9k-vYIAtuI4jqhHCgLQ/edit
-
1. Mek - Working Group focused on “how to train and educate development on using system level testing for better coverage”. Mek planning to kickoff MR on Feb 24th or 25th. -
1. Sam/Kenny - Planned shared KR for all Ops teams to identify top risk for their group, and itemize these in their handbook page. This will help ensure team members are aware of top risks in their product area, and will also provide an input to auditing integration test coverage in high risk areas. Current plan is to pilot this approach in teams with recent high severity incidents (e.g. Runner, Package) and roll out to the larger group in Q2 if successful. #10862 (closed) -
1. Sam - Weekly Ops Quality Triage. Initiate weekly triage meeting with SETs and Director + Senior Managers in Ops. Review top risks, testing progress, incidents. -
1. Sam - Discuss Stable Counterpart SRE proposal. Some team have operational concerns, but limited SRE knowledge within the team. Stable Counterpart SRE model is an approach that could help mitigate this.
- (Draft idea) For quality, we can measure number of incidents and see if we are decreasing in the months of March and April while we deploy.
- (Draft idea) For Scaling, a statement of what scale we think we are currently supportive for # of builds per month or day.
Retro
Good
- Introduced the practice of risk mapping and risk assessment to Development group teams
- Increased cross-functional engagement on Quality concerns between development, product, and quality team members
- Identified an experiment for introducing SRE stable counterparts into Product Groups.
Bad
- There is significant overlap between Quality and Reliability concerns which made framing of this KR challenging.
- We did not determine a metric based way to measure progress.
Try
- Broaden definition of Quality to include scaling, reliability, and availability concerns.
- Clarify ownership of Quality and Reliability centers of excellence (CoE) within product groups. Who owns these CoEs and how should product groups coordinate their quality and reliability efforts which are often intertwined in practice?
Edited by Sam Goldstein