Skip to content

Update FCL to state EMs can determine who must take part

David O'Regan requested to merge fcl-update into master

Purpose

This proposal is designed to increase the Efficiency and Results of teams who are directly involved with FCLs by offering a more involved/tactical process rather than a sledgehammer approach.

Pre FCL

Prior to FCLs, when an S1 or P1 was introduced there was no negative impact on the teams involved. This was in line with our blameless value and incidents were responded to in a mostly ad-hoc fashion either by the team themselves or by whoever managed to discover the solution first. While not perfect, this approach had some benefits in the consensus that no 1 team needed to be the DRI for an S1 / P1 and encouraged everyone to look to solve the issue if possible.

Post FCL

We now opt to place the entire team(s) on reliability issues despite what the team's working Backlog looks like or if that was even possible. This is counter to our values of Results and Efficiency as it can throw a team(s) into chaos with a lack of direction in terms of how they can actually help.

Example

  • Team A: Introduces S1
  • Team B: Merges Team As code
  • Team A: 3 FE Engineers, 2 BE Engineers
  • Team A: 5-day FCL kicks off with only 2 of the team members being able to genuinely tackle issues around reliability from the BE side of things, leaving the 3 FE Engineers trying to scramble to discover how they can help with reliability.

Proposed Solution

We allow Engineering managers to work with their team to determine who can make the most efficient impact over the 5-day FCL while allowing the rest of the team to continue current in-flight work.

Benefits

  1. This helps to promote our values of Results and Efficiency as it allows each team to determine the most effective way to tackle the S1/P1 and divide a strategy to prevent similar issues in the future.
  2. It allows the EM of the team to holistically determine how to find preventive measures moving forward for similar S1/P1s.
  3. With a smaller number of people involved, the issue can be iterated on much more quickly and there will be much less lost time due to a lack of context from certain members of the team.
  4. Team members who cannot meaningfully contribute to reliability do not lose 5 days of development time (even more in terms of cognitive load) .
  5. The rest of the team can continue to function as a holistic unit in a meaningful way in parallel with the reliability-focused team members.
  6. Helps to encourage a blameless culture by not punishing a team, rather trying to identify the right people to fix the issue at hand.
  7. Allows each EM to find the correct strategy for their individual team to react to FCLs. As each team is distinctly different, each EM should be able to find an efficient process for working with the right team members to iterate quickly on S1/P1 issues and ensure setting up preventive measures.

Potential Challenges

  1. Each team could have a different response process to an FCL.
  2. BE-heavy teams are more likely to see disparity in who would need to tackle these FCLs.
  3. We might miss some small reliability wins from not allowing all team members to the 5 day burn-down.
  4. Teams with Staff+ are more likely to see their time used more and more for handling FCLs and this might create knowledge gaps for handling incidents.
  5. EMs may not honor the process of setting their team's up for success with preventative measures after experiencing an FCL
    • This can be solved by empowering EMs to take 1 day of the FCL to spend time devising preventative measures and then communicating them to the team in an async retro.
  6. EMs might be at capacity and not effectively be able to co-ordinate who should be involved with the FCL in a timely manner.

Author Checklist

  • Provided a concise title for the MR
  • Added a description to this MR explaining the reasons for the proposed change, per say-why-not-just-what
    • Copy/paste the Slack conversation to document it for later, or upload screenshots. Verify that no confidential data is added.
  • Assign reviewers for this change to the correct DRI(s)
    • If the DRI for the page/s being updated isn’t immediately clear, then assign it to one of the people listed in the "Maintained by" section in on the page being edited.
    • If your manager does not have merge rights, please ask someone to merge it AFTER it has been approved by your manager in #mr-buddies.
  • [-] If the changes affect team members, or warrant an announcement in another way, please consider posting an update in #whats-happening-at-gitlab linking to this MR.
    • If this is a change that directly impacts the majority of global team members, it should be a candidate for #company-fyi. Please work with internal communications and check the handbook for examples.

Edited by David O'Regan

Merge request reports