Database Team - 16.7 Planning

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Capacity

This milestone, the Database group is operating at full capacity. For a list of upcoming absences, please refer to our weekly status update. Please also keep in mind that about half of the teams capacity is typically consumed by unplanned work.

Boards

Planning

We are maintaining focus on the initiatives that affect the most the availability and reliability of GitLab.com and self managed instances:

Our high-level focus is a multi-pronged database scaling strategy summarized here: https://gitlab.com/gitlab-org/gitlab/-/issues/397121+
We are additionally concerned by the potential risk of lightweight lock contention on our primary DB. This may impact DB availability during times of high traffic and may be exacerbated by our partitioning efforts. We are looking to strike a balance between faster access to data (partitioning) and high availability (mitigating lock contention). We believe our query testing efforts will help walk this fine line.

We have one FTE engineer assigned as a stable counterpart to support datastore solutions for AI-related initiatives.

Top Priorities for `16.7`

Reducing Lightweight Lock Contention

Based on our estimates in https://gitlab.com/groups/gitlab-org/-/epics/11639+s, we will brainstorm and prioritize opportunities to further improve database efficiency. Our target is to ensure database scalability until Tenant Cells is ready.

We are currently focussed on reducing lightweight lock contention as our biggest limiting factor. This workstream is closely coupled with our table size reduction efforts (described below).

DRI: @krasio

Table size reduction effort

Verify DRI: @mattkasa Create DRI: @dfrazao-gitlab

While Reducing lightweight lock contention is the team's primary concern (and therefore it's primary focus), we're still working with devopscreate and devopsverify to partition some of the largest tables in our database. @stomlinson is stepping back on these topics to focus on the WAL rate investigation.

WAL Rate Reduction

What: WAL rate (the rate that write-ahead-log data is generated by the primary postgres database) is an urgent contention point for gitlab.com. @msmiley has an excellent write-up in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2597.

Why: Without reducing wal generation rate (or possibly increasing our capacity to apply WAL to replica databases), replicas will fall behind in time and stop serving requests, leading to a site outage. It's critical that we avoid this scenario.

We plan to tackle this in a few different ways:

Provide mechanisms to tolerate replication lag 1.1 we'll add a way to run feature flags in the load balancer 1.2 we'll add a feature flag to ignore replication lag to keep the site running if the alternative is an outage.
Reduce wal 2.1 We'll find out which tables are generating the most wal volume (via pg_waldump and pg_stat_statements wal metrics) 2.2 we'll bug their teams to fix it

In %16.7 we concentrate on the first step of each of these approaches - identifying what's causing our wal rate (item 2.2), and also laying a foundation for reactive handling during an incident with 1.1.

DRI: @stomlinson

Secondary Focus Items

Migrations should run in milestone, then type, ... (gitlab-org&10411)

DRI: @jon_jenkins

What: We aim to enrich database migrations and tag them with the milestone they belong to. This will let us improve how we order migrations when execute them, and not rely on migration's timestamp (version) as it is a bit random and can be misleading.

Why: This will improve upgrade experience for self-managed customers that jump multiple milestones at a time - in case they hit an error when executing migrations, we will be able to tell with which GitLab release their schema is compatible with.

Update

Work is continuing: gitlab-org/gitlab!128144 (closed) For posterity, we did this spike as a proof-of-concept.

What's complete:

Milestone tagging for migration classes using developer-friendly syntax
Spec testing on custom version and milestone objects
~~Danger job to enforce tagging of version 2.2 migrations~~ We enforce this thru migration class inheritance
Code to read current milestone

What needs to be done:

~~Spec testing on relevant ActiveRecord overrides~~ Spec testing is only needed on new rake task overrides, figured out how to minimize ActiveRecord overrides
~~Mechanism to roll back to a given milestone~~ We will focus on this in a future iteration
Final implementation of the new ordering

The above will be done with a single forthcoming MR.

Kickoff Video

Edited Nov 16, 2023 by Roger Woo