FY26-Q4 Tenant Scale Weekly Updates
Latest Update (2025-11-07)
Help Needed / Challenges
- groupgitaly Rivian continues to push Gitaly beyond its limits, and Gitaly team is getting requests for deeper involvement in debugging their performance issues. Dedicating more engineering resources could help surface opportunities for improvement for more ultra-scale customers, but it would mean adjusting the timeline for RAFT since the team is at capacity.
📕 To Be Closed
- We have completed the Q3 work for Enable Sidekiq circuit-breaking on production
- We have completed the Q3 work for Backup and Restore testing environment improvements
- We have completed the Q3 work for Create new catchall redis shard
🌟 Highlights
- Put together a presentation about the team and what we will be doing in Q4.
- Met with
@amyphillipsto engage DevEx's help with the challenge identified last week for valkey support
- We have implemented a working proof of concept for pluggable object databases. This proof of concept uses MongoDB to store its objects and many of its parts already work as expected: you can use this repository for local workflows already, but there are still limitations. Using MongoDB is not meant as an endorsement of the technology, but was rather chosen for ease of implementation.
- Transactions benchmarking made another key discovery that there is a clear inflection point around 40 RPS for response time. Interestingly, in this scenario transactions only account for 10% of the latency. This surfaces an opportunity to fix a previously unknown scale bottleneck.
- RAFT Routing work has begun with the first MR merged that ensures the routing table is cleaned up after any failures.
- Org Mover: Selective Checksum/Sync By Organization epic: We have 6 of 14 data types supported with 4 more in-progress or in review. We're also working on a timeboxed PoC of Geo Protocells Mode - which would set up the Protocell as a Geo secondary site to enable us to use Geo mostly as-is for migrating data to a cell.
Organization Path Claiming - Happy Path Complete
- Merged and validated the happy path for claiming organization path through the Topology Service. This ensures that organizations will be unique across cells
- Demo video showing end-to-end flow
- Verification of the flow documented in gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#506 (comment 2877436314)
- We shipped rate limiting for the GET project members API endpoint (60 requests/minute per user). This endpoint has previously caused significant performance issues on gitlab.com due to the expensive queries involved – most users won't see an issue, but this should prevent badly-scripted calls from causing S1 incidents.
- We added the
activeargument to the user-contributed project resolver which improves performance by hiding inactive projects.
Previous Updates
2025-11-07
Help Needed / Challenges
- FYI - We are still aligning on a proposal for how to proceed after rolling back the feature flag to disable forced deletions. Progress has been slower with the Dublin product offsite taking place this week.
📕 To Be Closed
-
Native tool in Git to gather repository metrics... (gitlab-org&18040 - closed). We have upstreamed
git repo structureinto Git, which will evolve into a native replacement for git-sizer(1). The new tool will be used to feed dashboards built into GitLab that surface information around a repository's structure and health. This allows support and customers to more readily debug slow repositories and should thus help reduce the support load for Git and Gitaly.
🌟 Highlights
- Added the API endpoint to transfer a group into an Organization. This allows automated migration into Organizations and represents a significant milestone as we proceed towards transferring real customer groups into orgs.
- When an Organization is created via transfer, we now correctly assign todo items to the relevant internal bots (instead of previously-global ones).
- Merged the core logic for Claiming Organizations in the Rails monolith behind a feature flag. This allows us to ensure an Organization path is unique across cells. This work also introduces abstractions that allow us to more easily make other attributes claimable (email addresses for example).
- Fixed the Topology Service Security mirror, enabling security patches to be deployed without exposing vulnerabilities.
- Work on improving primary verification experience is progressing well. The last endpoint to the API was added, making it possible to use the API to recalculate the primary checksum for failed models only. This widely requested feature will enable SREs and customers to proactively and quickly resolve primary data corruption, especially before Dedicated migrations.
- Protocells Org Mover selective sync: 1 more data type was merged (DependencyProxy::Blob). 6 of 14 data types have been completed, with 4 more in review/in dev.
- We have landed a change in Gitaly that starts to use git-last-modified(1). This tool has been upstreamed by us to address an N+1 problem that we face in Gitaly in
ListLastCommitsForTree(). This RPC is executed whenever one navigates to the "Files" overview of a repository and spawns a separate process for each of the files. Early benchmarks show 3x improvements, but we expect even better results in production.
- Benchmarking continues to provide actionable insights. OverlayFS was tested to scale much better than the deepclone method of taking repository snapshots (enabling the WAL for RAFT). In the below graphic, snapshot latency is quadratic starting when repositories have 4k files, but OverlayFS shows much flatter latency growth. While more investigation is required, this data gives us confidence when considering different performance optimization strategies.
- Worked through the majority of items to bootstrap the team. grouptenant services is fully online!
- Finalized the list of Q4 projects and groomed Topology service readiness work, splitting it into multiple epics with DRIs assigned.
- With the team change, we're de-prioritizing sidekiq/redis work to focus more on supporting Cells. Dropping Q4 work including: Valkey unit tests (see https://gitlab.com/groups/gitlab-com/gl-infra/data-access/durability/-/epics/36#note_2869103679), moving KAS to its own Redis instance, and non-critical Redis/Sidekiq interrupt work.
Edited by Nick Nguyen
