Transition Reliability Standing Squads into Permanent Teams
Overview
The Reliability team is currently 33 engineers and 5 engineering managers in a matrix organisation aligned temporarily to squads and line management is done based on geographical proximity between the engineer and the best suited time zone engineering manager. This has served the team well in achieving important goals in FY23 but is now a limiting factor as discussed in #123
We've recently iterated to introduce standing squads to recognise that the requirement for long lived teams and this issue is to take that to the next level and introduce permanent teams with long term membership and to introduce global line management.
Principals
- Initially the teams won’t change names or core functions but we will drop the term Squads because it has been used for temporary teams. We’ll need to better define their visions, missions and boundaries. This will maintain existing stability and we'll iterate on this as we scale.
- Each team will have an Engineering Manager that everyone on the team directly reports to. This removes the matrix management we created and simplifies the relationships for both the manager and individual.
- Teams will be made up of geographically diverse team members and will operate as a global team. This allows us to have good coverage in knowledge throughout time zones and encourages us to work more asynchronously. There may be a need to be flexible as a team for sync activities but these should be reserved for when sync is truly required.
- Teams will be made up of all the roles required to achieve their missions. This means SRE's, DBRE's and Backend Engineers may all be on the same team.
- Each team will have a specific set of services or activities that they are responsible for to narrow down their focus outside of on-call. There will be a long term team mission with KR and PI’s that contribute to Infrastructure PI's associated with it.
- On-Call will remain as is and be shared amongst all the teams and team members. This is not ideal and we’ll iterate on this but not now.
- Team level makeups will vary by mission and available engineers but should aim to be 1x Manager, 1x staff, 2-3x Senior and 3x intermediate. We should stay inside the gearing ratio's and team sizes (point 5) generally implemented.
- Team capacity and membership is not permanent but should not fluctuate dramatically (so not to keep the team stuck in the forming phase of stages of group development and will be reviewed each quarter. There should ALWAYS be enough capacity to meet keeping the lights on and on-call activities. Team members should be encouraged to try out other areas of the infrastructure structure but should not be forced to if they enjoy their specialization and it meets their career goals. Line manager stability is an important concern when making membership changes.
- Vacancies will be assigned directly to each team and they should seek to hire individuals that meet the needs of that team at the time whilst recognising that the individual may want/need to move to another team based on capacity or career goals.
- Keep the lights on activities should be capacity planned each quarter and there should be efforts made each quarter to reduce the capacity required to keep the lights on to allow the teams to focus on more progressive work.
- Initially the teams will have LOTS of technical debt, this should be addressed as early as possible alongside other priorities to set the team up for long term success.
Next Steps
-
Gather feedback by 2022-12-28: DRI @alanrichards -
Update reliability page to remove standing squads and describe a standard team structure: DRI @alanrichards - gitlab-com/www-gitlab-com!117412 (merged) -
Update team definitions in handbook to include new mission, boundaries, KR’s & PI’s, etc and communicate to teams outside of Reliability: DRI: Respective Managers #163 (closed) #165 (closed) #166 (closed) #167 (closed) #168 (closed) -
Establish dedicated Slack channels for teams, deprecate old squad based Slack channels -
Make initial team assignments based on individual preference, geographically diversity and team capacity demand. #169 -
Create team formation template and assign to each team. #176 (closed) #177 #178 (closed) #179 (closed) #180 (closed) -
Change line managers for individuals and open roles. #169 & gitlab-com/www-gitlab-com!118921 (merged) -
Update hiring process for roles allocated to each team -
Create team "509" epics (see also the transition items on the vancouver workflow doc. https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17173 -
Write handbook page for Quarterly review of team capacity and include taking into account team member career mobility -
Assign items that are falling between gaps into teams or keep with the General Teams. -
Continue to iterate on this as the FY24 roles are opened.
Input, feedback, debate all welcome
Edited by Alan Richards