SQL Traffic Replay Tooling for GitLab.com
## Short description
This tool would allow us to collect and measure our database capacity. This could effectively settle questions about the capacity of both our current setup, as well as the effectiveness of other mitigations.
Develop comprehensive tooling to capture, store, and replay SQL query traffic from GitLab.com. This solution will implement a lightweight query forwarding mechanism within GitLab that sends SQL queries to an external service with minimal performance impact on Rails and Sidekiq processes. Combined with purpose-built replay utilities, this system will enable performance testing, capacity planning, and database architecture evaluation, allowing us to simulate production loads at variable speeds, identify saturation and contention points, assess potential database configuration changes, and validate sharding strategies without adversely affecting production systems.
## Connection to technical roadmap
https://gitlab.com/gitlab-com/gl-infra/infra-roadmap/-/merge_requests/199+
As Gitlab onboards new customers and more people concurrently use gitlab.com, it's vital that we maintain availability of our databases. Traffic replay will be a critical benchmarking component in service of this goal.
Traffic replay serves two related purposes:
1. Traffic replay will uncover upcoming failure modes for our databases as traffic scales, including failure modes that we are currently unaware of. By compressing traffic into a shorter timespan, we will simulate increased load at high fidelity and uncover these problems before production.
2. Traffic replay will be a high-quality benchmarking tool that we can use to explore changes to database infrastructure, including changes that are otherwise high risk but could buy us a lot of capacity. Vertical splits, sharding, and database version upgrades are all risky to deploy to production, but have the possibility of drastically increasing our capacity. Traffic replay will allow us to safely measure and compare any such changes to database architecture, and be confident in deploying to production.
Both of these are critical in maintaining our availability going forward in the face of increased scaling demands.
<!--
Link to https://infra-roadmap-c6d14f.gitlab.io/ items, such as https://infra-roadmap-c6d14f.gitlab.io/#project-durability_unified_backup_restore
In the technical roadmap [YAML](https://gitlab.com/gitlab-com/gl-infra/infra-roadmap/-/blob/main/data/stage-data_access.yml), explain how the project aligns with stated goals of the company. Which goal(s) is it helping along?
- [FY26-28 Platforms strategy](https://docs.google.com/document/d/1E5T9TSkqxWkvCpWNbfrqmEFM5sXMO4HT-D22m_QjfyA/edit?tab=t.0)
- [FY26 Data Access Product Outcomes](https://docs.google.com/document/d/1ymSJU24RkSC7n4YzuwYgQj6F8Yx_rNsSBk-2hIgkxI4/edit?tab=t.0#heading=h.lun6ty6cqwk6)
- [FY26 plan from Bill](https://university.gitlab.com/learn/course/draft-fy26-company-memo/main/fy26-company-memo?client=internal-team-members&page=3)
- [Data Access Vision](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/data-access/#vision)
Don't forget to link back this epic from the yml.
-->
## Expected impact
<!-- What will happen? How will it be measured? -->
* Negligible performance impact on production Rails/Sidekiq processes
* Increased confidence in database scaling decisions and configuration changes
* Enhanced ability to identify and mitigate performance bottlenecks proactively
* Improved database architecture testing capabilities without production risk
## Exit criteria
<!-- When is it done? Prevent scope creep by defining it here! -->
* Successfully capture 2+ hours of production query traffic
* Replay captured traffic against test environment at 1x, 1.5x speeds
* Validate replay results against expected performance metrics
* Complete documentation of capture and replay processes
* Complete automation of infrastructure creation, traffic replay, and infrastructure deletion
* Automation of capture data deletion according to data retention policies
## Timeline and Effort
<!-- How much time do we think it'll take to complete? Wild guesses are appropriate here! We can always iterate on any section here. -->
~12-16 weeks with 2-3 engineers from ~"group::database frameworks" and 1-2 SREs from ~"group::database operations".
## Deliverables
- %"18.0"
- [x] Define SQL Traffic Capture and Replay Architecture (https://gitlab.com/gitlab-org/gitlab/-/issues/538895)
- [x] Document security implementation details
- [x] Security review
- [x] Data engineering review
- %"18.1"
- [x] Provision cloud pubsub and dataflow
- [x] Provision pgbouncer and patroni chef roles
- [x] Infrastructure cost estimate (https://gitlab.com/gitlab-org/gitlab/-/issues/541537)
- [x] Upgrade grpc gem (https://gitlab.com/gitlab-org/gitlab/-/issues/547514)
- %"18.2"
- [x] Start POC of SQL traffic capture (https://gitlab.com/gitlab-org/gitlab/-/issues/548606)
- [x] Start POC of SQL traffic replay (https://gitlab.com/gitlab-org/gitlab/-/issues/548607)
- %"18.3"
- [x] Continue POC of SQL traffic capture (https://gitlab.com/gitlab-org/gitlab/-/issues/548606)
- [x] Continue POC of SQL traffic replay (https://gitlab.com/gitlab-org/gitlab/-/issues/548607)
- %"18.4"
- [x] Configure bucket access for Rails (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/202140)
- [x] Configure charts for bucket access (https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/4521)
- [x] Create staging bucket infrastructure (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20477)
- %"18.5"
- [x] Create production bucket infrastructure (https://gitlab.com/gitlab-org/gitlab/-/issues/562377)
- %"18.6"
- [ ] Merge capture implementation (https://gitlab.com/gitlab-org/gitlab/-/merge_requests/197240)
- [ ] Rollout SQL Traffic Replay Feature Flag (https://gitlab.com/gitlab-org/gitlab/-/issues/573592)
- [ ] Get SQL traffic replay to read from bucket (https://gitlab.com/gitlab-org/gitlab/-/issues/573593)
- %"18.7"
- [ ] Add config to pods and test bucket config in staging (https://gitlab.com/gitlab-org/gitlab/-/issues/573594)
- [ ] Write SQL traffic capture to the bucket (https://gitlab.com/gitlab-org/gitlab/-/issues/573596)
- [ ] Adapt replayer format to match capture format (https://gitlab.com/gitlab-org/gitlab/-/issues/573597)
- %"18.8"
- [ ] Build a self-service way to restore a staging backup to a point in time - requires DBO or SRE help (https://gitlab.com/gitlab-org/gitlab/-/issues/573598)
- [ ] Restore a staging database backup (https://gitlab.com/gitlab-org/gitlab/-/issues/573599)
- Future
- [ ] Document first draft of SQL Traffic Replay data format (https://gitlab.com/gitlab-org/gitlab/-/issues/548605)
- [ ] End-to-end capture and replay validation
<!-- DO NOT EDIT BELOW - Used for the epic status automation bot -->
<!-- STATUS NOTE START -->
## Status 2025-12-02
:tada: **achievements**:
- `@mattkasa` has MWPS set on https://gitlab.com/gitlab-org/gitlab/-/merge_requests/197240+
- `@stomlinson` merged https://gitlab.com/gitlab-org/database-team/traffic-replay-poc/-/merge_requests/2+
- `@l.rosa` merged https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/4636+
:issue-blocked: **blockers**:
- None, but `@stomlinson` will be working on a different project this week.
:arrow_forward: **next**:
- `@l.rosa` will work on merging https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/4996+
- `@mattkasa` will work on rolling out the feature flag for capture
_Copied from https://gitlab.com/groups/gitlab-org/-/epics/17719#note_2925521693_
<!-- STATUS NOTE END -->
epic