Discussion: Introduce an 'Infrastructure Well-Architected Service Framework'
Problem
We have seen an increase in new services in the GitLab ecosystem and we expect this trend to continue. It is currently difficult for service owners to know what is expected of their services from an Infrastructure perspective. To provide just a few examples, the following questions are currently difficult to find answers for (or service owners don't even know that they should be considerations):
- What are the scaling parameters of my service? Will it scale on GitLab.com?
- What are the availability expectations of my service?
- Is Disaster Recovery an important consideration?
- Is there a specific logging format I should use?
- Are there any special requirements to secure customer data?
- How do I know how much it costs to operate my service?
Infrastructure Well-Architected Service Framework
Architecture Frameworks are implemented by the big cloud providers (AWS, Azure, GCP) to recommend design principals and best practices for running software in their respective ecosystems. They all commonly refer to the same 5 'Pillars':
Reliability | The ability of a system to recover from failures and continue to function. |
Security | Protecting applications and data from threats. |
Cost optimization | Managing costs to maximize the value delivered. |
Operational excellence | Operations processes that keep a system running in production. |
Performance efficiency | The ability of a system to adapt to changes in load. |
The purpose of this issue is to propose the creation of a similar, but lightweight, framework that provides guidance for running services on GitLab's platforms, which would serve as the entry point to understand and gain knowledge of how to run services at scale.
Doesn't a Readiness Review solve this problem?
Our readiness review process is evolving based on the trend of new services being created. The process isn't a single-source-of-truth of what good looks like for all new services. It is a point-in-time review of a service to sign it off for production use - it isn't an ongoing, continuously improving process for a given service. The readiness review process could reference the Infrastructure Well-Architected Service Framework. Further, service owners will be encouraged to review the framework before any potential readiness review, hopefully resulting in fewer issues at this stage.
How does this overlap with the Service Maturity Model?
The Service Maturity Model is a list of all operational services, along with a set of criteria that make up 'levels'. It is effectively a checklist of useful tools and process that can be used to help operate services at scale. It doesn't serve as a knowledge base or educational tool - i.e. it doesn't explain why or how the tools/processes should be implemented.
How does this fit in with Runway?
The Runway Platform delivers a simple way to implement the Infrastructure Well-Architected Service Framework. That is, Runway aims to provide the tools, best practises and processes described by the Framework 'for free' to all services deployed through it.
References