Database Team - 16.9 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
This milestone, the Database group is operating at reduced capacity due to end of year PTOs. For a list of upcoming absences, please refer to our weekly status update. Please also keep in mind that about half of the teams capacity is typically consumed by unplanned work.
Boards
Planning
We are maintaining focus on the initiatives that affect the most the availability and reliability of GitLab.com and self managed instances:
- Our high-level focus is a multi-pronged database scaling strategy summarized here: https://gitlab.com/gitlab-org/gitlab/-/issues/397121+
- We are additionally concerned by the potential risk of lightweight lock contention on our primary DB. This may impact DB availability during times of high traffic and may be exacerbated by our partitioning efforts. We are looking to strike a balance between faster access to data (partitioning) and high availability (mitigating lock contention). We believe our query testing efforts will help walk this fine line.
We have one FTE engineer assigned as a stable counterpart to support datastore solutions for AI-related initiatives.
16.9
Top Priorities for Reducing Lightweight Lock Contention
Based on our estimates in https://gitlab.com/groups/gitlab-org/-/epics/11639+s, we will brainstorm and prioritize opportunities to further improve database efficiency. Our target is to ensure database scalability until Tenant Cells is ready.
We are currently focussed on reducing lightweight lock contention as our biggest limiting factor. We have put a number of mitigations in place, including the removal of under-utilized indices. This workstream is closely coupled with our table size reduction efforts (described below).
DRI: @krasio
Update: We are actively working to improve the robustness of our mitigations by implementing vertically table splits and plan caching. This will also help with WAL rate reduction (another scaling limitation called out below).
Table size reduction effort
Verify DRI: @mattkasa
Create DRI: @dfrazao-gitlab
While Reducing lightweight lock contention is the team's primary concern (and therefore it's primary focus), we're still working with devopscreate and devopsverify to partition some of the largest tables in our database. @stomlinson
is stepping back on these topics to focus on the WAL rate investigation.
Related items:
-
CI Partitioning Support
-
Update: The team has started writing to the newly partitioned
ci_builds
andci_builds_metadata
tables! - The use of a new partition significantly reduced the load on the
ci_builds
table and the work the vacuum process needed to do, allowing background migrations to complete in mere weeks instead of months.
-
Update: The team has started writing to the newly partitioned
-
Partition merge_request_diff_*
- Update: Our part of this is mostly done, the helpers are created and Diogo is working with the create team to use them.
- We are available to support the merge_request_diff team as they implement the partitioning itself.
WAL Rate Reduction
What: WAL rate (the rate that write-ahead-log data is generated by the primary postgres database) is an urgent contention point for gitlab.com. @msmiley
has an excellent write-up in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597.
Why: Without reducing wal generation rate (or possibly increasing our capacity to apply WAL to replica databases), replicas will fall behind in time and stop serving requests, leading to a site outage. It's critical that we avoid this scenario.
In %16.9, we are continue our efforts to support feature flags in the load balancer; we hope to add a feature flag to ignore replication lag to keep the site running if the alternative is an outage.
We'll also continue our investigation into ways to reduce WAL volume by analyzing WAL files from production, and simulating how they would change if we had different checkpointing frequencies.
DRI: @stomlinson
Secondary Focus Items
Migrations should run in milestone, then type, ... (gitlab-org&10411)
DRI: @jon_jenkins
What: We aim to enrich database migrations and tag them with the milestone they belong to. This will let us improve how we order migrations when execute them, and not rely on migration's timestamp (version) as it is a bit random and can be misleading.
Why: This will improve upgrade experience for self-managed customers that jump multiple milestones at a time - in case they hit an error when executing migrations, we will be able to tell with which GitLab release their schema is compatible with.
Update
Work is continuing: gitlab-org/gitlab!137190 (merged) For posterity, we did this spike as a proof-of-concept.
What's complete:
- Milestone tagging for migration classes using developer-friendly syntax
- Spec testing on custom version and milestone objects
- We enforce this thru migration class inheritance
- Code to read current milestone
What is pending release:
- Actual migration ordering code
- Rolling back
- db:migrate:status displays in correct order, milestone, and migration type
What needs to be done:
- Final debugging and testing before release
Load Balancer Improvements
DRI: @mattkasa
What: We're making some changes to the load balancer to support more efficent and zone aware traffic routing. To accomplish this, we're also making necessary improvements to the underlying classes.
Why: As a critical part of our database layer, having a stable and reliable load balancer is important. Adding zone-aware database routing will reduce network request time and cost by picking replicas in the same zone first.