Database Group - 16.5 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
In %16.5, the Database group is operating at full capacity. For a list of upcoming absences, please refer to our weekly status update. Please also keep in mind that about half of the teams capacity is typically consumed by unplanned work.
Boards
Planning
We are maintaining focus on the initiatives that affect the most the availability and reliability of GitLab.com and self managed instances:
- Our high-level focus is a multi-pronged database scaling strategy summarized here: https://gitlab.com/gitlab-org/gitlab/-/issues/397121+
- We are additionally concerned by the potential risk of lightweight lock contention on our primary DB. This may impact DB availability during times of high traffic and may be exacerbated by our partitioning efforts. We are looking to strike a balance between faster access to data (partitioning) and high availability (mitigating lock contention). We believe our query testing efforts will help walk this fine line.
We have one FTE engineer assigned as a stable counterpart to support datastore solutions for AI-related initiatives.
16.5
Top Priorities for Post-Upgrade PG14 Support
DRI: @stomlinson
While previous upgrades have gone well, we're setting aside some capacity to ensure we have folks available to fix any urgent issues that may arise as a result of the PG14 upgrade on 2023-09-09. We spent some considerable time on this last milestone (about 2 team weeks), and while we don't anticipate further issues, recognizing that they may arise and require capacity is important.
We will also use this capacity to do further investigations into our load balancing code to correct some other non-blocking issues that were identified after the previous upgrade attempt. We believe these investigations can improve database performance for all GitLab customers.
Related Item: PG14 Upgrade CR
Table size reduction effort
DRI: @stomlinson
Partitioning efforts across GitLab's main and CI database are groupdatabase 's top priority for ensuring GitLab continues to scale in a performant way. These are infradev prioritized issues and we are working closely with individual groups to tackle some of GitLab's largest tables. We expect this effort to yield meaningful performance gains since vacuum pressure caused by large table sizes is one of our biggest bottlenecks.
Update: The database team has shifted to understanding and improving the situation of lightweight lock (LWLock) contention, as partitioning could significantly exacerbate this problem and pose an availability risk for GitLab.com.
In %16.5 we will continue our investigation into lightweight lock contention and attempt to alleviate it by removing a few unused indexes from the namespaces
table.
In %16.4 we determined a lightweight lock acquisition rate for each database based on snapshots of pg_stat_statements
, but this analysis does not consider locks already taken due to open transactions. We expect that this inaccuracy will be negligible, but need to verify the analysis. In %16.5, we plan to verify this by sampling a lightweight lock rate from a production database during a low traffic time and compare it to the analysis already performed, thus verifying that the analysis closely matches production.
We're also working with the groupsource code team to implement new helpers to assist them in partitioning the merge_request_diff_*
tables. ( gitlab-org/gitlab#423780 (closed))
Related items: CI Partitioning Support
Update Primary Keys to BigInt
DRI: @krasio
The database team has made good progress towards mitigating Primary Key exhaustion: https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/548. All work is done for GitLab.com, and we are almost done for self-managed:
- GitLab.com - all done
- Self-managed:
- For all tables but
sent_notification
, the MRs for swapping columns are merged in 16.4 - gitlab-org/gitlab#417402 (closed). - In 16.5 we can start merging the clean up MRs
- For
sent_notifications
, we are taking a slower approach in order to avoid issues during upgrade, as what we had withci_build_needs
. The MR to fix the missing column/back-fill migration is ready to be merged - gitlab-org/gitlab!126763 (merged). Once we have it shipped we can move forward with this table.
- For all tables but
Automated Database Testing
DRI: @mattkasa
What: We're extracting queries and identifying added or changed queries from a merge request. Long term, we will provide analysis on these extracted queries to ensure they meet our guidelines.
Why: We're hoping to help people identify their new queries early and give database reviewers a boost by extracting the sql automatically. Additionally, this should aid partitioning efforts by helping teams identify queries that do not contain the partitioning key for their newly partitioned tables.
Progress update: This was a bit waylaid last milestone due to the focus on the PG14 upgrade. This milestone, we hope to deliver a single report which only includes newly added queries.
Focus Items
Migrations should run in milestone, then type, ... (gitlab-org&10411)
DRI: @krasio
What: We aim to enrich database migrations and tag them with the milestone they belog to. This will let us improve how we order migrations when execute them, and not rely on migration's timestamp (version) as it is a bit random and can be misleading.
Why: This will improve upgrade experience for self-managed customers that jumo multiple milestones at a time - in case they hit an error when executing migrations, we will be able to tell with wich GitLab release their schema is compatible with.
Update
Work has started on this initiative: gitlab-org/gitlab!128144 (closed) For posterity, we did this spike as a proof-of-concept.
What's pending completion (open MR):
- Milestone tagging for migration classes using developer-friendly syntax
- Spec testing on custom version and milestone objects
- Code to read current milestone
What's conceptually done (by "conceptually done" we mean we have the code, but still need to split it in smaller chunks in order to get it through review and merge):
- Custom sort order that considers legacy migrations first, then sorts by milestone, then timestamp, then migration type, with post migrations sorted after regular ones
- Modifications to db:migrate to ensure that:
- Migrations are applied in the proper order and filtering up to and including a passed milestone is available
- Custom sort order is applied for multi-DB configurations
- db:migrate:status now displays a migration's milestone
What needs to be done:
- Spec testing on relevant ActiveRecord overrides
- Danger job to enforce tagging of version 2.2 migrations
- Mechanism to roll back to a given milestone