Proposal: split GitLab monolith into components

Problems

The problems indicated in this issue are largely inspired by @ayufan's work in #31121.

The gitlab-org/gitlab Rails monolith has over 2.2 million lines of Ruby code and hundreds of engineers make hundreds of changes every day.

The monolith is under an ongoing database decomposition where CI database is being split from the main database. This is a first step toward componentizing the monolith. However, the architecture pattern, and its directory structure, we are still using today is largely Model-View-Controller applied to the whole app. While this works well for a while, enterprise Rails applications at scale can't be sustainably managed using the MVC layered architecture.

Today we are experiencing a series of problems categorized below.

Engineering

Onboarding engineers is a very slow process. It takes a while before someone feels productive due to the size of the context and lack of isolation which introduces a lot of caveats.
For engineers (even experts) it's impossible to build a mental map of the application due to its size. Even apparent confined changes can have far-reaching and catastrophic repercussions on other parts of the monolith.
Shipping features gets more complex and costly due to the high degree of coupling.
It becomes harder for engineers to become maintainers due to ever growing complexity gitlab-com/www-gitlab-com#13312 (closed)

Architecture

No clear domain structure. We have have forced the creation of some modules but have no company-wide strategy.
No degree of isolation between existing modules.
We don't write good abstractions and use well-documented and stable internal APIs, instead we tend to access concrete classes that may be private domain concepts. This causes an explosion of complexity.
Refactorings are getting harder and riskier due to the interconnected codebase. This also translates in an higher engineering effort to solve technical debt.
The majority of the code is not-namespaced and unorganized (e.g. lib/ folder).
Moving stable parts of the application into separate services is impossible due to high coupling.
The architecture hasn't evolved although the product and company has changed significantly over the years.
Rails does not provide tools to effectively enforce boundaries. Everything lives under the same memory space.

Productivity

Longer cycle-time due to higher cognitive load in development and reviews.
It can be overwhelming for the wider-community to contribute to our codebase.
As parts of the code are not isolated we need to test the whole monolith all the time. This results in ever growing pipelines.
No way to partially test the codebase.

Proposal

TL;DR: Change Rails monolith from a big ball of mud state to a modular monolith. Extract cohesive functional domains (components) into separate directory structure using Domain-Driven Design practices. Extract a platform component for truly cross-cutting concerns and optionally separate the web-layer into a Rails engine or web component. Use Packwerk to enforce privacy and dependency between components.

PoC: Draft: PoC - extract CI into component (!88899 - closed)

Breakdown the application into components:
- split domain code into separate components (ci, merge_requests, workspaces, packages, ...). Initially as simple Packwerk packages.
- platform component: the supporting layer that most of the domain components depend on. Includes tools such as loggers, database/redis utils, exclusive locks, base classes (ApplicationRecord, ApplicationWorker, ...), etc. This is the boilerplate/toolbox code that a component needs in order to run. This should not contain any form of domain code, only tooling.
- web: the web layer of our application (controllers, views, REST API, GraphQL, authentication). This could be a rails engine if we need to conditionally load (e.g. for Sidekiq deployments) otherwise each part could go into the dedicated domain component.

Dependency diagram

flowchart LR

subgraph Rails
direction LR
web([web]) -- depends on --> merge_requests -- depends on --> platform
web --> ci --> platform
web --> repositories --> platform
merge_requests --> ci
ci --> repositories
merge_requests --> repositories
folders[doc, db, config, qa, locale, ...]
end

ci --exclusively uses --> cidb[CI database]
Rails --> maindb[Main database]

Hexagonal architecture + modular monolith

Define and enforce explicit boundaries.
- Have highly documented and stable internal API (for example, service objects that interface with a bounded context). A domain A should interact with domain B through this interface rather than referencing internal constants.
Refactor the code using a data-driven approach using the dependency graph and reports of violations
- Use stable internal APIs for communication between components (driven by privacy and dependency violations).
- Use Gitlab::EventStore as tool for dependency inversion (driven by the dependency graph and dependency violations).
Run Packwerk static analyzer in CI (as we do with Rubocop) to catch violations.

Code reorganization without proper encapsulation doesn't make much of a difference since all classes are public in Ruby. This is why Packwerk here make the difference, allowing us to put cohesive code into the same package but also encapsulate all the details, leaving other components depend on stable abstractions. With Packwerk all constants can be made private by default.

Advantages

Having clear and enforced boundaries (privacy and dependency) between components allows us to change components internals without affecting the outside world.
Explicit dependencies. We can automatically generate and document our dependency graph for visual representation. Having explicit dependencies we know what can be affected by our changes.
Such componentised architecture could help decompose GitLab's database further and easier than our large effort in decomposing CI database out of main. Lessons learned from CI decomposition (#361484 - closed)
Packwerk can also be used to define sub-components and encapsulate details further. This avoids that, as the code grows, each component becomes a small ball of mud.

Expected results

Shorter onboarding process and lower barrier for community contributions.
Faster feedback loop during development due to boundary violations being found early; faster CI due to testing only the affected parts.
Safer changes and faster dev cycle due to clear contracts/interfaces between components and enforced boundaries.
Easier to separate components further such as database decomposition, once we have explicit boundaries.
Better evolutionary architecture of each component and overall, due to encapsulation and to a better understanding of the dependencies.
Improved team agility. With decoupled components, refactorings are mostly isolated from the external dependencies so they are safer and faster. Faster changes due to limited scope.
Using components the code will better represents the product. Today many functional components (feature categories) don't have explicitly defined boundaries.
For every MRs we could test only the affected component and its dependencies, reducing further the CI feedback cycle. This improves the team agility but also impacts community contributors since they will consume less CI minutes and could contribute more in a month.

Challenges

Such changes require a shift in the development mindset in order to understand the benefits of the modular architecture and not fallback into legacy practices.
Changing the application architecture is not an easy neither short task. It takes time, resources and commitment but most importantly it requires buy-in from engineers.
This may require us to have a medium-long term team of engineers (a Working Group or Special Interest Group) that makes progresses on the architecture evolution plan, foster discussions in various engineering channels and resolve adoption challenges.
We need to ensure we build standards and guidelines and not silos.
We need to ensure we have clear guidelines on where new code should be placed. We must not recreate dropbag folders like lib/.

PoC

I experimented with this approach in Draft: PoC - extract CI into component (!88899 - closed) where I mostly prepared the application to support multiple components and start with moving CI code into a component. However, a new PoC is also being evaluated Draft: POC - Use Packwerk to create `components... (!98801 - closed) which allows a more incremental approach with minimal Packwerk violations.

Iteration plan

Start with listing all the Ruby files in a spreadsheet and categorize them into components following the Domain-Driven Design principles of bounded contexts. Some of them are already pretty explicit like Ci::, Packages::, etc. Components should follow our existing naming guide. This should probably be a Working Group with representative members of each devops stage (e.g. Senior+ engineers). The WG would help defining high-level components and will be the DRIs for driving the changes in their respective devops stage.
Create an Architecture Evolution Blueprint to describe this iteration plan and the long term vision.
We can start with extracting ci domain into a component (Packwerk package into the main app). Given the work on decomposing CI database out of the main database, and the fact that Ci:: and Gitlab::Ci:: namespaces are among the oldest and most well defined namespaces in the codebase, it makes sense to start with that.
- Changes should be done in iterations. (1) create an empty component, (2) move files into the component, (3) disable privacy and dependency Packwerk rules so we don't raise violations, (4) Gradually make constants private and refine/fix the design.
- Domain logic in lib/ must be moved inside the component directory.
- In parallel to the iteration plan we need to work off the list of the recorded Packwerk violations (if any) to improve the architecture. These recorded violation are like Rubocop TODO files.
Extract other components identified in Step 1 and follow the instructions as per ci component.
- In parallel to the iteration plan we need to work off the list of the recorded violations to improve the architecture.
Extract all the cross-cutting concerns code in lib/ into a platform component. Make existing components depend on it. Use Packwerk to record violations.
- In parallel to the iteration plan we need to work off the list of the recorded violations to improve the architecture.
We could extract the web component into a rails engine. We have a PoC that verified the feasibility of it and the immediate improvements it can bring in having Sidekiq servers not loading the web component.
- Use Packwerk around the rails engine to guard against privacy and dependency violations. Record existing violations and fix them in iteration.
[optional] As a domain component becomes well decoupled and its dependencies well defined, we could extract it into a Rails engine and set dependencies such as: web --> domain engine X --> platform. However Rails engines require strong isolation and dependency management. This is why Packwerk is more flexible and allows us to iterate without the need to use engines. In fact we may not even need that.

About coupling

References

Edited Sep 23, 2022 by Fabio Pitino