[Paused] Discussion: Tiered Support Model for Runway
Issue currently paused, to be picked up at a later date
TL;DR
Services are hard, culture is harder, we should look to shape the next 3+ years of support expectations and align on a model that makes sense for self-serve and not self-serve.
Context
Referring back to the original blueprint that inspired Runway, a core requirement R110
was stated as:
With the exception of an Infrastructure-led onboarding process, services are owned, deployed and managed by stage-group teams. In other words,services follow a “You Build It, You Run It” model of ownership.
This is far from what we have now and at this point in time may be impractical for several reasons:
- Stage teams do not have SRE capability
- Investment in capability in all Dev teams is not desirable from a business PoV
- It's extra workload that stage teams do not have to consider today
However, with the introduction of solutions like Runway, we can improve the tools and processes such that deep SRE expertise is not required for most support scenarios and we can codify common practices and sensible defaults. This is unlikely to replace the need for centralised, expert SREs in the long term and we should plan both for the transitional phase, where new "You Build It, You Run It" expectations are set with Development teams as well as the long term where expert SREs may be required, along with the Runway team on a tiered support basis.
Challenges
- GitLab does not currently have a concept of tiered support internally (L1, L2, L3, etc.)
- Capacity constraints are everywhere, particularly in Reliability and Scalability Groups
- Stage teams have a cultural expectation around "That's an infrastructure problem" / "Throw it over the fence"
- Borders and responsibilities between Dev/Infra are blurred at best and unclear at worst
Goals
-
Start the discussion about what the next generation of SRE support should look like (Runway in this context but does not need to be limited to runway - Cells, Satellite services, etc.) -
Put together an infrastructure PoV -
Define Responsibilities and Team Boundaries e.g. -
Who creates alerts for my feature/product? -
Me on call? You on call? -
Scaling, rate limits and acceptable user experience/appdex ? -
I have a GCP Problem -
Quota increases -
Edge Cases
-
-
Make proposal on Tiered Support / Future of Support -
Engage Development on proposal
Open Questions
- Is a Tiered support Model the right thing?
- How much of a barrier are current cultural expectations?
- If there is any extra funding required - how much?