Skip to content
Snippets Groups Projects
Commit 5206214d authored by Sam Wiskow's avatar Sam Wiskow
Browse files

Update Scalability Group to Production Engineering Stage

parent 834c87e8
No related branches found
No related tags found
1 merge request!136903Update Scalability Group to Production Engineering Stage
......@@ -147,7 +147,7 @@
^[Product - SaaS Platforms Section]
/source/direction/saas-platforms/ @fzimmer @david @gl-product-leadership
/source/direction/saas-platforms/delivery/ @fzimmer @swiskow
/source/direction/saas-platforms/scalability/ @fzimmer @swiskow
/source/direction/saas-platforms/production-engineering/ @fzimmer @swiskow
/source/direction/saas-platforms/dedicated/ @fzimmer @cbalane
/source/direction/saas-platforms/switchboard/ @fzimmer @lbortins
 
......
......@@ -151,7 +151,7 @@ Deploying features can be very costly on production. We aim to make it simpler f
## Categories
 
- [Delivery](/direction/saas-platforms/delivery/)
- [Scalability](/direction/saas-platforms/scalability/)
- [Scalability](/direction/saas-platforms/production-engineering/)
- [GitLab Dedicated](/direction/saas-platforms/dedicated/)
- [Switchboard](/direction/saas-platforms/switchboard/)
 
......
---
layout: markdown_page
title: "Product Direction - Scalability"
description: "GitLab's Scalability group is responsible for making sure that GitLab runs at scale across our SaaS platforms"
title: "Product Direction - Production Engineering"
description: "GitLab's Production Engineering Stage is responsible for making sure that GitLab runs at scale across our SaaS platforms"
---
 
- TOC
{:toc}
 
## Overview
The Scalability group is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms.
The group works in close coordination with the **Platform Engineering** teams.
We support other Engineering teams by sharing data and techniques so they can become better at scalability as well.
## Vision
 
As its name implies, the Scalability group enhances the **availability**, **reliability** and performance of GitLab's SaaS platforms by observing the application's capabilities to operate at scale.
The **Scalability group** analyzes application performance on GitLab's SaaS platforms, recognizes bottlenecks in service availability, proposes (and develops) short term improvements and develops long term plans that help drive the decisions of other Engineering teams.
The Production Engineering Stage is focused on two core pillars that provide value for our customers
 
1. Building out paved roads that team members can follow to get software into production
2. Providing insight into & scaling the production operations that keep GitLab available and performant at scale.
This will position Production Engineering as a core engine or flywheel for growth at GitLab, enabling team members to get services to production quickly and reliabily.
Runway will serve all product lines and be the best example of a paved road for new services at GitLab, making it easy to get them in customers hands.
Other paved roads will emerge to make it easier for GitLab teams to create and work inside the monolith as well including paved roads for observability, rate limiting and our edge network that form part of our core services.
In addition, we will transform the operator experience by raising the bar on our SRE and incident management practices, providing new levels of insight and visibility to guide day 50 operations.
We will also take steps to re-architect parts of the application to improve resilience and usability, meaning that customers will be able to consume GitLab SaaS services with great user experience and availability.
Finally, comprehensive education programs delivered through GitLab's internal L&D platform will be available for all team members and required for certain roles, ensuring that everyone is able to create features that scale with our systems.
 
## Challenges
<!-- Optional section. What are our constraints? (team size, product maturity, lack of brand, GTM challenges, etc). What are our market/competitive challenges? -->
......@@ -30,14 +33,11 @@ Discoverability is also a significant challenge in the platform space.
It is vital that users of platform tools are able to quickly discover and implement shared tools and best practices.
If the tools are not flexible, easy to discover and easy to implement, they may hurt feature velocity rather than increase it.
 
The Scalability group can often become the owner of components and be responsible for maintaining them in an operational sense, for example [Redis](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/878).
This work can shift capacity away from other enabling tools that would make GitLab easier to scale and in an ideal world, Scalability would be able to handover tools that allow teams to maintain their own components.
Lastly, considering the image below, Scalability tends to operate on the right hand side of the graph, after changes are deployed to our SaaS platforms.
This can mean that our work is reactive in nature and we often treat symptoms of bad health in the platforms instead of root causes.
Shifting the scaling concerns left, earlier in the software development lifecycle, will help us to scale our SaaS platforms more efficiently.
 
Two examples where the Scalability Group has already shifted left is [Error Budgets](https://about.gitlab.com/handbook/engineering/error-budgets/) and [Capacity Planning](https://about.gitlab.com/handbook/engineering/infrastructure/capacity-planning/).
Two examples where the Production engineering stage has already shifted left are [Error Budgets](https://about.gitlab.com/handbook/engineering/error-budgets/) and [Capacity Planning](https://about.gitlab.com/handbook/engineering/infrastructure/capacity-planning/).
 
![alt text](shift_left.png "Shift Left Model")
 
......@@ -56,6 +56,15 @@ We envision “Observability Units” rolled out alongside our GitLab instances,
These “Observability Units” will be durable and capable of running independently without connection to the global stack.
The global stack will be eventually consistent with all of the units in the fleet, whilst preserving the data security and visibility requirements that we ensure today.
 
#### Single Pane of Glass that scales across GitLab use cases
As the complexity of GitLab's production systems grows, the observability of GitLab's platforms is critical.
In FY25, Dedicated and GitLab.com use different logging tools and have different operational solutions to their specific requirements.
This creates a lot of inefficiency in some of our most critical operations - error investigation, debugging and issue resolution.
Additionally, different product lines have different levels of isolation required and different access controls on operational data.
We will work to provide a single pane of glass that gives team members a low context entrypoint whilst handling the complex access and isolation requirements in the background.
This will make our observability platform easier to use and result in higher quality software that works well across product lines and not just for GitLab.com.
#### Leveraging cost effective managed service providers
 
As the number of GitLab instances grows, increasing the level of automation required to rollout and operate both units and the global observability stack will be paramount.
......@@ -69,6 +78,11 @@ We expect to continue to maintain some number of GitLab operated foundational se
GitLab’s [availability metrics](https://handbook.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability) represent our experience of running and operating GitLab.com at million user scale.
This has been successful so far, but as the number of capabilities and use cases grow on our platform, we want to shift these to better reflect the user experience, rather than the operator experience.
 
#### Instrumentation of core user journeys
In the future we will instrument end to end user journeys on the platform, which represent core user experiences that must be performant to drive increase in customer satisfaction.
A side benefit of instrumenting user journeys will be the abilitiy to utilise these SLIs and SLOs for the purpose of evolving our testing strategy, shifting to observed metrics over fragile end to end tests and getting insight into performance only possible through observing production traffic.
#### Published availability metrics
 
As mentioned in other sections of this page, GitLab Dedicated and a cellular architecture increase the complexity of operating the SaaS solutions offered by GitLab.
......@@ -105,6 +119,11 @@ Documentation and processes should be clear and easy to find and follow.
Feature owners and service owners should be empowered to operate their features and services as far as reasonably possible.
Self-serve will be a core part of this, with simple interfaces to allow SREs to collaborate on issues where deep expertise is required.
 
#### Foundations
Foundational components like DNS records, WAF and rate limiting should be easy to configure and set up for GitLab teams.
Self-serve should mean that teams are able to reduce their direct interation with SRE and move services to production faster.
### Solutions at GitLab Follow the Well Architected Services Framework
 
As part of the move to paved roads, we’ll create a Well Architected Services framework, which will tie in with our Service Maturity Model.
......@@ -121,13 +140,24 @@ We can improve this by providing a single pane of glass that empowers service ow
This should culminate in a Thinnest Viable Platform, where GitLab team members can discover vital information about their service(s)' health, key infrastructure performance indicators and other information that will contribute to the decision making process in feature development.
This will be composed of atomic tools and solutions, be customized to various GitLab roles and reduce cognitive load, increase discoverability & efficiency across GitLab.
 
### Incident management delivers best in class usuability & insights
We have observed significant gaps in the capability of the native functionality that make performing analysis and generating insight from our incidents is hard.
This makes it harder to prevent customer impacting incidents from re-occurring as well as to identify and address systemic issues
Our incident management practice meets a high bar currently, but over time we will evolve our capability to deliver reporting, insight and analysis into the causes and frequency of incidents.
This will drive an increased level of visibility into problem areas at the team and also the executive level, meaning that we can make tactical and strategic prioritisation decisions about the work that teams do to improve the quality of the platform.
At the same time we will invest in new incident management platforms that make it easy to create and manage incidents, complete thorough and timely incident reviews and raise the bar for quality across GitLab.
Through greater insight and ability to capture more detail about incidents and corrective actions, we will be able to integrate into other parts of the organisation and develop "risk assessments" based on the pieces of the system we are changing and the frequency/severity of incidents related to changing those parts of the system.
This should allow us to streamline controls in areas of the product that are low risk and increase the controls and governance of the platform in higher risk areas, leading to a lower change failure rate across the board.
## 1-Year Plus Plan
 
<!-- Describe key themes, projects, and/or features planned over the next year. Also highlight what we will not be doing in the next year -->
After the expansion of the Scalability Group and to stop this page becoming too long, we have broken out the one year plan into two new pages:
 
- [Observability Direction](/direction/saas-platforms/scalability/observability)
- [Practices Direction](/direction/saas-platforms/scalability/practices)
- [Observability Direction](/direction/saas-platforms/production-engineering/observability)
- [Practices Direction](/direction/saas-platforms/production-engineering/practices)
 
<!-- These pages ^^ still need to be created at the time of this MR, remove this comment when they are created -->
 
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment