Database Team - 16.7 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
This milestone, the Database group is operating at full capacity. For a list of upcoming absences, please refer to our weekly status update. Please also keep in mind that about half of the teams capacity is typically consumed by unplanned work.
Boards
Planning
We are maintaining focus on the initiatives that affect the most the availability and reliability of GitLab.com and self managed instances:
- Our high-level focus is a multi-pronged database scaling strategy summarized here: https://gitlab.com/gitlab-org/gitlab/-/issues/397121+
- We are additionally concerned by the potential risk of lightweight lock contention on our primary DB. This may impact DB availability during times of high traffic and may be exacerbated by our partitioning efforts. We are looking to strike a balance between faster access to data (partitioning) and high availability (mitigating lock contention). We believe our query testing efforts will help walk this fine line.
We have one FTE engineer assigned as a stable counterpart to support datastore solutions for AI-related initiatives.
16.7
Top Priorities for Reducing Lightweight Lock Contention
Based on our estimates in https://gitlab.com/groups/gitlab-org/-/epics/11639+s, we will brainstorm and prioritize opportunities to further improve database efficiency. Our target is to ensure database scalability until Tenant Cells is ready.
We are currently focussed on reducing lightweight lock contention as our biggest limiting factor. This workstream is closely coupled with our table size reduction efforts (described below).
DRI: @krasio
Table size reduction effort
Verify DRI: @mattkasa
Create DRI: @dfrazao-gitlab
While Reducing lightweight lock contention is the team's primary concern (and therefore it's primary focus), we're still working with devopscreate and devopsverify to partition some of the largest tables in our database. @stomlinson
is stepping back on these topics to focus on the WAL rate investigation.
Related items:
WAL Rate Reduction
What: WAL rate (the rate that write-ahead-log data is generated by the primary postgres database) is an urgent contention point for gitlab.com. @msmiley
has an excellent write-up in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597.
Why: Without reducing wal generation rate (or possibly increasing our capacity to apply WAL to replica databases), replicas will fall behind in time and stop serving requests, leading to a site outage. It's critical that we avoid this scenario.
We plan to tackle this in a few different ways:
- Provide mechanisms to tolerate replication lag 1.1 we'll add a way to run feature flags in the load balancer 1.2 we'll add a feature flag to ignore replication lag to keep the site running if the alternative is an outage.
- Reduce wal 2.1 We'll find out which tables are generating the most wal volume (via pg_waldump and pg_stat_statements wal metrics) 2.2 we'll bug their teams to fix it
In %16.7 we concentrate on the first step of each of these approaches - identifying what's causing our wal rate (item 2.2), and also laying a foundation for reactive handling during an incident with 1.1.
DRI: @stomlinson
Secondary Focus Items
Migrations should run in milestone, then type, ... (gitlab-org&10411)
DRI: @jon_jenkins
What: We aim to enrich database migrations and tag them with the milestone they belong to. This will let us improve how we order migrations when execute them, and not rely on migration's timestamp (version) as it is a bit random and can be misleading.
Why: This will improve upgrade experience for self-managed customers that jump multiple milestones at a time - in case they hit an error when executing migrations, we will be able to tell with which GitLab release their schema is compatible with.
Update
Work is continuing: gitlab-org/gitlab!128144 (closed) For posterity, we did this spike as a proof-of-concept.
What's complete:
- Milestone tagging for migration classes using developer-friendly syntax
- Spec testing on custom version and milestone objects
-
Danger job to enforce tagging of version 2.2 migrationsWe enforce this thru migration class inheritance - Code to read current milestone
What needs to be done:
-
Spec testing on relevant ActiveRecord overridesSpec testing is only needed on new rake task overrides, figured out how to minimize ActiveRecord overrides -
Mechanism to roll back to a given milestoneWe will focus on this in a future iteration - Final implementation of the new ordering
The above will be done with a single forthcoming MR.