Scalability Practice (!53326) · Merge requests · GitLab.com / www-gitlab-com

Gerardo Lopez-Fernandez requested to merge gerir/engineering/architecture/scalability into master Jun 18, 2020

Why is this change being made?

In 2019, GitLab.com experienced prolonged and significant instability, primarily due to scalability limits on some critical compoenents. Self-managed GitLab also faced some of its own issues. We have since worked diligently towards addressing scalability broadly, and lot of this work has focused on the organization:

Rapid Action Issue and Development Escalation Process
The Scalability Team in the Infrastructure Department
The Database Team in the Development Department
The Performance and Availability Board
Reference architectures
Extensive work in observability and service levels

These organizational changes enabled us to focus, collaborate, and solve a number of critical problems, such as those that centered around Redis, Sidekiq, and PGbouncer, among others.

We also two important lessons:

Scalability is a shared concern, so we aligned the organization to that effect
Timing is key, so we made a significant investment in observability
- Too early will generally lead to premature optimization
- Too late will threaten availability, and, by extension, the business

From those experiences, one of the greatest and most obvious scalability concerns we have had has been the “database” (i.e., Postgres, tho we should likely start thinking about a name that does not reflect the actual database product): we know, intuitively, that it will eventually run into issues, so earlier this year, the Database Sharding Working Group was formed.

Scaling Postgres

We know that the database will not scale up ad eternum, so we have been exploring solutions involving database sharding to scale it out. There is one formal proposal to shard by tenant:

Tenant sharding

Additionally, there's extensive analysis that explores the potential of sharding by root namespace:

https://about.gitlab.com/handbook/engineering/development/enablement/database/doc/root-namespace-sharding.html

Both approaches are backed by relevant data analysis, and they both reflect a desire to find a visible, high-level, easy-to-grasp splitting variable to use in the creation of logical groupings to shard the database, especially within the context of our values of Iteration (MVC) and Efficiency (Boring solutions).

And yet, arguably, there is nothing boring about sharding, especially about database sharding. The database is such a foundational component that even seemingly small changes can have major detrimental ripple effects across the board. The database is everything, its criticality far surpassing that of just about any other component, given the very stringent performance and durability requirements, coupled with the intrinsic complexity of the relationships it holds.

Database sharding is one of the hardest and most sensitive scalability problems we need to solve, bordering on rite of passage status into growing and maturing the environment at scale, Every decision we make has long-lasting effects, every iteration will naturally limit the space of future options. Furthermore, as time goes by, the dataset size and the relationships built into it will inevitably increase, making it more difficult over time to execute future sharding maneuvers, as they’re likely to require data migrations.

Scalability is a Practice, and a Strategic One at That

We are now facing a third, critical lesson: scalability is a strategic practice. We are entering the realm of at scale proper. As an analogy, we are approaching the threshold that separates Newtonian physics from Quantum physics.

Which one of the two sharding proposals should we choose? They’re both sensible, they’re both technically correct, they’re both right, and they’re both wrong: that one option is complex and difficult to implement while the other is better aligned with our values does not build a strong case for selecting one over the other. A third option, for which we can find a large number of well-known examples, could propose a service-oriented split. A fourth one might toss Postgres altogether and select a different backend.

Which is to say: we lack a framework to evaluate them in context and we lack best practices guidelines to lead our decisions.

Author Checklist

Provided a concise title for the MR
Added a description to this MR explaining the reasons for the proposed change, per say-why-not-just-what
Assign this change to the correct DRI
- If the DRI for the page/s being updated isn’t immediately clear, then assign it to your manager.
- If your manager does not have merge rights, please ask someone to merge it AFTER it has been approved by your manager in #mr-buddies.
- If the changes relate to any part of the project other than updates to content and/or data files please make sure to ping @gl-static-site-editor in a comment for a review and merge. For example changes to .gitlab-ci.yml, JavaScript/CSS/Ruby code or the layout files.

For help with failing pipelines reach out in #mr-buddies in Slack

Edited Jun 22, 2020 by Gerardo Lopez-Fernandez

Scalability Practice

Why is this change being made?

Scaling Postgres

Scalability is a Practice, and a Strategic One at That

Author Checklist

Merge request reports