Skip to content

Scalability Practice

Why is this change being made?

In 2019, GitLab.com experienced prolonged and significant instability, primarily due to scalability limits on some critical compoenents. Self-managed GitLab also faced some of its own issues. We have since worked diligently towards addressing scalability broadly, and lot of this work has focused on the organization:

These organizational changes enabled us to focus, collaborate, and solve a number of critical problems, such as those that centered around Redis, Sidekiq, and PGbouncer, among others.

We also two important lessons:

  • Scalability is a shared concern, so we aligned the organization to that effect
  • Timing is key, so we made a significant investment in observability
    • Too early will generally lead to premature optimization
    • Too late will threaten availability, and, by extension, the business

From those experiences, one of the greatest and most obvious scalability concerns we have had has been the “database” (i.e., Postgres, tho we should likely start thinking about a name that does not reflect the actual database product): we know, intuitively, that it will eventually run into issues, so earlier this year, the Database Sharding Working Group was formed.

Scaling Postgres

We know that the database will not scale up ad eternum, so we have been exploring solutions involving database sharding to scale it out. There is one formal proposal to shard by tenant:

Additionally, there's extensive analysis that explores the potential of sharding by root namespace:

Both approaches are backed by relevant data analysis, and they both reflect a desire to find a visible, high-level, easy-to-grasp splitting variable to use in the creation of logical groupings to shard the database, especially within the context of our values of Iteration (MVC) and Efficiency (Boring solutions).

And yet, arguably, there is nothing boring about sharding, especially about database sharding. The database is such a foundational component that even seemingly small changes can have major detrimental ripple effects across the board. The database is everything, its criticality far surpassing that of just about any other component, given the very stringent performance and durability requirements, coupled with the intrinsic complexity of the relationships it holds.

Database sharding is one of the hardest and most sensitive scalability problems we need to solve, bordering on rite of passage status into growing and maturing the environment at scale, Every decision we make has long-lasting effects, every iteration will naturally limit the space of future options. Furthermore, as time goes by, the dataset size and the relationships built into it will inevitably increase, making it more difficult over time to execute future sharding maneuvers, as they’re likely to require data migrations.

Scalability is a Practice, and a Strategic One at That

We are now facing a third, critical lesson: scalability is a strategic practice. We are entering the realm of at scale proper. As an analogy, we are approaching the threshold that separates Newtonian physics from Quantum physics.

Which one of the two sharding proposals should we choose? They’re both sensible, they’re both technically correct, they’re both right, and they’re both wrong: that one option is complex and difficult to implement while the other is better aligned with our values does not build a strong case for selecting one over the other. A third option, for which we can find a large number of well-known examples, could propose a service-oriented split. A fourth one might toss Postgres altogether and select a different backend.

Which is to say: we lack a framework to evaluate them in context and we lack best practices guidelines to lead our decisions.

Author Checklist

  • Provided a concise title for the MR
  • Added a description to this MR explaining the reasons for the proposed change, per say-why-not-just-what
  • Assign this change to the correct DRI
    • If the DRI for the page/s being updated isn’t immediately clear, then assign it to your manager.
    • If your manager does not have merge rights, please ask someone to merge it AFTER it has been approved by your manager in #mr-buddies.
    • If the changes relate to any part of the project other than updates to content and/or data files please make sure to ping @gl-static-site-editor in a comment for a review and merge. For example changes to .gitlab-ci.yml, JavaScript/CSS/Ruby code or the layout files.

For help with failing pipelines reach out in #mr-buddies in Slack

Edited by Gerardo Lopez-Fernandez

Merge request reports