META - RFC - Notes on Sharding, GitLab Scaling, Availability and related issues

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Close this issue

RFC - Request for Comments

Invitation to link related issues for Sharding and GitLab architecture.

  • Sharding
  • DR
  • HA
  • Deduplication
  • Multi-site
  • Locking data hosting to specific geographic regions
  • User-TZ
  • User-Locale
  • Containerization
  • Orchestration
  • Global Ingress - Anycast
  • Blackbox integration test

Observations after deploying both small and global instances of GitLab:

GitLab has the GitLab EE and Gitlab CE product and team with a proven track record of providing highly available single-node solutions (> 6 9’s when subtracting the effect of deployments/upgrades.) On the other hand, the availability has not proven to be better than two nines when the roles/functions within a GitLab instance are spread over a cluster of nodes.

  1. as a result, if a "shard" is defined to be a full GitLab stack from ingress to persistent storage, the cluster of "shards" can be highly reliable as a result of the inherent bulkheads between each shard.
    1. This is “boring” - simpler and easier to maintain than having a separate technology stack for sharding EACH of the persistent storage engines (Redis, PostgreSQL, POSIX (NFS) ), each of which has their own scaling challenges.
    2. It also allows different shards to operate at different rev-levels of the code-base (enhancing ability to create canaries, A/B feature testing, etc)
  2. Sharding — DR/HA/Failover should be done at the level of a shard as a full instance of GitLab. The shards should be small enough to provide rapid backup/restore/cloning.
  3. Shards should be mirrored remotely, with failover and parallel read capability.
  4. Corollary - There should be enough shards to provide the whole gitlab.com without exceeding the size limitation of item 1 (above)
    1. if gitlab.com is 200TB (heading to 1 PB based on sales goals), 1,000 shards would yield 200GB (soon to be 1TB) per shard.
    2. 1,000 simple instance shards would also have lower OPEX than the current gitlab.com HA clusters supporting NFS / PostGreSQL / Redis, and much easier to maintain.
    3. SRE - large number of shards greatly enhances ability to deliver 5 9’s or better SLA to end users as the risk of component failure is confined to a small subset of the entire user base via inherent bulkheading between shards.
    4. As the system grows, the number of shards should grow accordingly — the easiest growth strategy is to split an existing shard by hash range on namespace/repo path
  5. Shards should be a “collection of repositories” to allow complete user visible functionality in the event of component failures within a single shard
    1. system availability is greatly improved
      1. current system of GitLab.com (Sept 2017) is single instance for all users, DR is (Time-to-restore * number of users), and time-to-restore is linear in the number of users / repositories., which is O(N^2) relative to the number of users.
      2. technology stacks are BORING - less complexity than horizontal clusters
        1. since the DR is handled by mirror shards of an entire instance, no true need for Redis/Sentinel, Gitaly, PGpool/PostGresSQL as separate nodes.
        2. no need for name lookup/DNS resolver to connect to most of the related services in the GitLab stack since they can connect with local unix sockets.
  6. User management/identify/authentication is an orthogonal concern to the sharding by groups of repositories. — separation of concerns at an architecture level
    1. short term, user identify/authorization of all users can be syncrhonized onto all shards.
    2. one of the most costly queries (frequency * cost in CPU and I/O) is “create list of projects user is authorized to see” — and it is doing table scans on project table. Use another strategy such as separate table, index scans, or Map/Reduce cluster instead of relational DB)
  7. Migration to be provided at a project/repo level. i.e.; move project/repo from ShardA to ShardB — this supports initial distribution of data from the single Azure Cluster of GitLab.com
  8. Region Locking - a common customer requirement is restricting the physical location of their data to a particular geographic/political region for privacy, compliance, and latency purposes.
    1. Repo will be assigned to machine instance in appropriate location.
  9. User connection to correct shard can be handled by either routing from the ingress nodes, or by redirect to the user for HTTPS sessions.
  10. Deduplication
    1. git packs are not conducive to global deduplication. Git is already an object store with individual hashed object paths — combining those into bundles (packs) hides necessary information that would allow the same object in forked/cloned repositories to share the same storage bytes/instance as the original upstream repo.
  11. Multi-site
    1. shards, as full GitLab instances, can be hosted on multiple hosting providers using quick deployment with tools such as Docker-Machine, Helm-charts, Rancher, etc.
    2. federate all shards into one cohesive user experience
    3. all policies per-repo/project as to which data centers are eligible to host the primary and slave instances of the shard
  12. User management
    1. Users need to be able to specify their own timezone preference, which may be different than their browser.
    2. Users need to be able to specify their own locale preferences, which may be different than their browser.
    3. User Identity needs to be separate service from main GitLab Rails App.
    4. User Authorization needs to be separate service from GitLab Rails App
  13. Black box integration test
    1. full end-to-end GitLab functionality test needed at an API and HTTP system boundary. This does not need to include full code-coverage testing which is already performed at a component level.
    2. Serves all GitLab deployments.
  14. Ingress to be any-cast with multiple points of presence around the globe.
    1. from point of entry, each request would be redirected (or routed) to the back-end shard able to handle the request.

Related Issues

Challenges:

  • When a single repository is larger than the desired shard sizing. (i.e.: GitLab CE repo)
  • When a single repository must support a larger number of concurrent users than a single node instance can handle — some horizontal/vertical scaling will be necessary.

cc: @ernstvn @sytses @stanhu @mydigitalself @joshlambert

Edited Jan 23, 2026 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading