Add SRE roles section to Reliability handbook page
Why is this change being made?
While the Reliability Team has a defined process for working on projects and incidents, there is another category of work that is not accounted for. Smaller requests comes in all the time and, currently, the only way to get this work done is to pull people off of a project or on-call shift. The context switching that this pattern requires is significant. It impacts total processing time of issues and, more importantly, the stress levels of SREs.
Additionally, it has become very common for engineers to face multiple incidents at the same time while on call. This, too, has a very negative impact on the stress levels of SREs. Having to field multiple incidents on one shift can also make it difficult for an EOC to work on anything else. The net result is more stress for the SREs and a longer processing time for non-urgent issues.
To help with this problem we'd like to introduce the concept of a Backup Engineer On Call (BEOC) and further clarify some of the already existing roles within the Reliability team. We are aiming to achieve the following benefits as a result:
- A designated supporting resource for EOCs during on-call shifts.
- A process for redirecting miscellaneous requests
- Clarity of tasks and assignments.
- Reduced context switching
- Ability to better handle multiple incidents or larger incidents
- Progress multiple categories of work
- Smaller issues/non-projects
- Corrective Actions
- Staging
- Decrease of cycle time for small requests that enable other teams