META - RFC - Notes on Sharding, GitLab Scaling, Availability and related issues
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
RFC - Request for Comments
Invitation to link related issues for Sharding and GitLab architecture.
- Sharding
- DR
- HA
- Deduplication
- Multi-site
- Locking data hosting to specific geographic regions
- User-TZ
- User-Locale
- Containerization
- Orchestration
- Global Ingress - Anycast
- Blackbox integration test
Observations after deploying both small and global instances of GitLab:
GitLab has the GitLab EE and Gitlab CE product and team with a proven track record of providing highly available single-node solutions (> 6 9’s when subtracting the effect of deployments/upgrades.) On the other hand, the availability has not proven to be better than two nines when the roles/functions within a GitLab instance are spread over a cluster of nodes.
- as a result, if a "shard" is defined to be a full GitLab stack from ingress to persistent storage, the cluster of "shards" can be highly reliable as a result of the inherent bulkheads between each shard.
- This is “boring” - simpler and easier to maintain than having a separate technology stack for sharding EACH of the persistent storage engines (Redis, PostgreSQL, POSIX (NFS) ), each of which has their own scaling challenges.
- It also allows different shards to operate at different rev-levels of the code-base (enhancing ability to create canaries, A/B feature testing, etc)
- Sharding — DR/HA/Failover should be done at the level of a shard as a full instance of GitLab. The shards should be small enough to provide rapid backup/restore/cloning.
- Shards should be mirrored remotely, with failover and parallel read capability.
- Corollary - There should be enough shards to provide the whole gitlab.com without exceeding the size limitation of item 1 (above)
- if gitlab.com is 200TB (heading to 1 PB based on sales goals), 1,000 shards would yield 200GB (soon to be 1TB) per shard.
- 1,000 simple instance shards would also have lower OPEX than the current gitlab.com HA clusters supporting NFS / PostGreSQL / Redis, and much easier to maintain.
- SRE - large number of shards greatly enhances ability to deliver 5 9’s or better SLA to end users as the risk of component failure is confined to a small subset of the entire user base via inherent bulkheading between shards.
- As the system grows, the number of shards should grow accordingly — the easiest growth strategy is to split an existing shard by hash range on namespace/repo path
- Shards should be a “collection of repositories” to allow complete user visible functionality in the event of component failures within a single shard
- system availability is greatly improved
- current system of GitLab.com (Sept 2017) is single instance for all users, DR is (Time-to-restore * number of users), and time-to-restore is linear in the number of users / repositories., which is O(N^2) relative to the number of users.
- technology stacks are BORING - less complexity than horizontal clusters
- since the DR is handled by mirror shards of an entire instance, no true need for Redis/Sentinel, Gitaly, PGpool/PostGresSQL as separate nodes.
- no need for name lookup/DNS resolver to connect to most of the related services in the GitLab stack since they can connect with local unix sockets.
- system availability is greatly improved
- User management/identify/authentication is an orthogonal concern to the sharding by groups of repositories. — separation of concerns at an architecture level
- short term, user identify/authorization of all users can be syncrhonized onto all shards.
- one of the most costly queries (frequency * cost in CPU and I/O) is “create list of projects user is authorized to see” — and it is doing table scans on project table. Use another strategy such as separate table, index scans, or Map/Reduce cluster instead of relational DB)
- Migration to be provided at a project/repo level. i.e.; move project/repo from ShardA to ShardB — this supports initial distribution of data from the single Azure Cluster of GitLab.com
- Region Locking - a common customer requirement is restricting the physical location of their data to a particular geographic/political region for privacy, compliance, and latency purposes.
- Repo will be assigned to machine instance in appropriate location.
- User connection to correct shard can be handled by either routing from the ingress nodes, or by redirect to the user for HTTPS sessions.
- Deduplication
- git packs are not conducive to global deduplication. Git is already an object store with individual hashed object paths — combining those into bundles (packs) hides necessary information that would allow the same object in forked/cloned repositories to share the same storage bytes/instance as the original upstream repo.
- Multi-site
- shards, as full GitLab instances, can be hosted on multiple hosting providers using quick deployment with tools such as Docker-Machine, Helm-charts, Rancher, etc.
- federate all shards into one cohesive user experience
- all policies per-repo/project as to which data centers are eligible to host the primary and slave instances of the shard
- User management
- Users need to be able to specify their own timezone preference, which may be different than their browser.
- Users need to be able to specify their own locale preferences, which may be different than their browser.
- User Identity needs to be separate service from main GitLab Rails App.
- User Authorization needs to be separate service from GitLab Rails App
- Black box integration test
- full end-to-end GitLab functionality test needed at an API and HTTP system boundary. This does not need to include full code-coverage testing which is already performed at a component level.
- Serves all GitLab deployments.
- Ingress to be any-cast with multiple points of presence around the globe.
- from point of entry, each request would be redirected (or routed) to the back-end shard able to handle the request.
Related Issues
Challenges:
- When a single repository is larger than the desired shard sizing. (i.e.: GitLab CE repo)
- When a single repository must support a larger number of concurrent users than a single node instance can handle — some horizontal/vertical scaling will be necessary.
cc: @ernstvn @sytses @stanhu @mydigitalself @joshlambert
Edited by 🤖 GitLab Bot 🤖