Database Team - 16.9 Planning

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Capacity

This milestone, the Database group is operating at reduced capacity due to end of year PTOs. For a list of upcoming absences, please refer to our weekly status update. Please also keep in mind that about half of the teams capacity is typically consumed by unplanned work.

Boards

Planning

We are maintaining focus on the initiatives that affect the most the availability and reliability of GitLab.com and self managed instances:

Our high-level focus is a multi-pronged database scaling strategy summarized here: https://gitlab.com/gitlab-org/gitlab/-/issues/397121+
We are additionally concerned by the potential risk of lightweight lock contention on our primary DB. This may impact DB availability during times of high traffic and may be exacerbated by our partitioning efforts. We are looking to strike a balance between faster access to data (partitioning) and high availability (mitigating lock contention). We believe our query testing efforts will help walk this fine line.

We have one FTE engineer assigned as a stable counterpart to support datastore solutions for AI-related initiatives.

Top Priorities for `16.9`

Reducing Lightweight Lock Contention

Based on our estimates in https://gitlab.com/groups/gitlab-org/-/epics/11639+s, we will brainstorm and prioritize opportunities to further improve database efficiency. Our target is to ensure database scalability until Tenant Cells is ready.

We are currently focussed on reducing lightweight lock contention as our biggest limiting factor. We have put a number of mitigations in place, including the removal of under-utilized indices. This workstream is closely coupled with our table size reduction efforts (described below).

DRI: @krasio

Update: We are actively working to improve the robustness of our mitigations by implementing vertically table splits and plan caching. This will also help with WAL rate reduction (another scaling limitation called out below).

Table size reduction effort

Verify DRI: @mattkasa Create DRI: @dfrazao-gitlab

While Reducing lightweight lock contention is the team's primary concern (and therefore it's primary focus), we're still working with devopscreate and devopsverify to partition some of the largest tables in our database. @stomlinson is stepping back on these topics to focus on the WAL rate investigation.

WAL Rate Reduction

What: WAL rate (the rate that write-ahead-log data is generated by the primary postgres database) is an urgent contention point for gitlab.com. @msmiley has an excellent write-up in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597.

Why: Without reducing wal generation rate (or possibly increasing our capacity to apply WAL to replica databases), replicas will fall behind in time and stop serving requests, leading to a site outage. It's critical that we avoid this scenario.

In %16.9, we are continue our efforts to support feature flags in the load balancer; we hope to add a feature flag to ignore replication lag to keep the site running if the alternative is an outage.

We'll also continue our investigation into ways to reduce WAL volume by analyzing WAL files from production, and simulating how they would change if we had different checkpointing frequencies.

DRI: @stomlinson

Secondary Focus Items

Migrations should run in milestone, then type, ... (gitlab-org&10411)

DRI: @jon_jenkins

What: We aim to enrich database migrations and tag them with the milestone they belong to. This will let us improve how we order migrations when execute them, and not rely on migration's timestamp (version) as it is a bit random and can be misleading.

Why: This will improve upgrade experience for self-managed customers that jump multiple milestones at a time - in case they hit an error when executing migrations, we will be able to tell with which GitLab release their schema is compatible with.

Update

Work is continuing: gitlab-org/gitlab!137190 (merged) For posterity, we did this spike as a proof-of-concept.

What's complete:

Milestone tagging for migration classes using developer-friendly syntax
Spec testing on custom version and milestone objects
We enforce this thru migration class inheritance
Code to read current milestone

What is pending release:

Actual migration ordering code
Rolling back
db:migrate:status displays in correct order, milestone, and migration type

What needs to be done:

Final debugging and testing before release

Load Balancer Improvements

DRI: @mattkasa

What: We're making some changes to the load balancer to support more efficent and zone aware traffic routing. To accomplish this, we're also making necessary improvements to the underlying classes.

Why: As a critical part of our database layer, having a stable and reliable load balancer is important. Adding zone-aware database routing will reduce network request time and cost by picking replicas in the same zone first.

Kickoff Video

Edited Jan 17, 2024 by Roger Woo