Initial Proposal for a Service onboarding process with early engagement
Context
GitLab product is growing and is naturally introducing new services/components. For the way GitLab is currently delivering software, new services/components are started by development teams, and only after some development phases, or during the final ones, the Infrastructure Group is involved in evaluating which one are the requirements needed to onboard a service into the production environment and fulfill all the operational requirements necessary. The involvement of the Infrastructure Group in onboarding a new service into production is required, but when it happens, it is at the discretion of the development team. Often, Infrastructure is involved only at later stages of the development lifecycle, usually close to the launch. This is when there is an SRE involvement to assess eventual production readiness. One of the reasons for this not timely involvement can also be related to the fact that we are currently missing a single source of truth with detailed steps of how and when to approach Infrastructure, where to find available documentation, and where to route the various requests to.
GitLab's requirement for a service to be onboarded is to complete a Production Readiness Review (PRR). The development team often considers and carries out this activity in the latest stages before launch.
With this approach, these operational readiness requirements are coming in at the final stages of development. The upcoming release date can contribute to rushing the operational requirements, overlooking them, or considering them second-class citizens compared to the service/component feature development, resulting in increased launch risk that could potentially impact the entire product and the Service Level Objectives communicated to our customers.
While the goal of a new service/component is to provide an added value to the end user, this added value is not achieved if a service/component runs unreliably and does not provide the right level of information to whoever has to operate it or to the development teams that authored it.
Like in the SDLC, when the cheaper solution for a bug is to fix it earlier, the same approach can be applied to address operation readiness requirements at an earlier stage.
The main limitation of the current approach with the PRR model is that the service has already been developed, and the SRE engagement starts very late in the development lifecycle.
The current process with the timeline below
Proposal (high level view)
Establish a next iteration over the PRR Model where Service Development teams engage with teams in charge of delivering and supporting a new service (Delivery, Reliability and Scalability, AppSec) earlier in the service development lifecycle to have better opportunities to identify and fix potential issues. This proposal loosely follows the early SRE Engagement model of the SRE Book. The earlier the engagement of these teams (ideally even during the design phase), the earlier decisions can be changed (e.g., when designing for a Stateful service), and implementation can be fixed, the least the effort and risk to roll out the new service is.
Even if SREs are involved later than the design stage but earlier than the latest stages, several aspects like instrumentation, metrics, resource usage, and scaling up/down strategies can already be addressed.
Decisions on how to design for scalability, properly test a new service for non-functional requirements, choose the best environment for the test, and rollout strategy to expose the service to a percentage of controlled production traffic to evaluate the new component under real-world conditions can all be planned and introduced earlier to gain a better understanding of the new service behavior and accordingly revisit any decision taken.
Process
The graph below highlights the high-level process, with actors, activities, and milestones expected to be completed to get to the Launch phase with a good and tested posture.
Phase | Actors Involved | Activity | Output and Milestones |
---|---|---|---|
Planning | Development Team | - Service is planned - Contact:
|
|
Design -- This is phase is composed of multiple iterations |
Development Team | Service designing |
|
Infrastructure SREs — Delivery, Reliability, Scalability |
|
||
App Sec |
|
||
Infra Sec |
|
||
Implementation -- This is phase is composed of multiple iterations |
Development Team |
|
|
Infrastructure SREs — Delivery, Reliability, Scalability |
|
||
App Sec |
|
||
Infra Sec |
|
||
Launch -- The launch phase marks the step when the Service goes to the Operate mode Development/Maintenances of the Service continues The Service iteratively will cycle design, implementation phases for new iterations. These new iterations are subject to extra reviews and documentation updates when needed |
Development Team |
|
|
Infrastructure SREs — Delivery, Reliability, Scalability |
|
||
App Sec |
|
||
Infra Sec |
|
Considerations
Currently, per our documentation around Alpha, Beta, Limited Availability, and Generally Available Features, Production Readiness Review is only required for GA Features.
Services that are initially launched in Alpha, Beta and Limited Availability are bypassing the PRR with the result of having to converge with Infrastructure/AppSec/InfraSec requirements at later stages. This could lead to sub-optimal onboarding and operational readiness when moving to General Availability.
Potentially, lighter touch early SRE engagement could be part also of Alpha, Beta and Limited Availability services to build a better production posture from early stage.
Communication
How do we make aware all the teams of the new process when it is finalized and agreed upon?
- Handbook
- Update GitLab Docs
- Entry in "What’s happening at GitLab"
- Engineering-fyi document
- Infrastructure Group Conversation
Next Steps
-
Contact all the involved stakeholders to collect feedback -
Adapt the proposal and process to the stakeholders feedback -
Collect from main actors set of detailed milestones they expect to see accomplished with a timeline -
Draft a formalized document with actors, points of contact, milestones to be achieved and timelines that can be used for each new service.