Skip to content

Discussion/review of the database schema and migrations

Context

Please see the issue description at #104 (closed). The following issues are also relevant:

  • Estimated database size requirements: Detailed description and metrics about the database populated with the dev.gitlab.org container registry metadata (the size of the database, tables, indexes, row count, etc.). This was used to estimate the size of a database for the gitlab.com registry.

  • Estimated database query rate: Metrics about the gitlab.com container registry HTTP API and an estimate of the query rate (broken down by SELECT, INSERT, etc.) for a database serving the gitlab.com container registry.

Purpose

This MR is only for demonstration purposes, making it easier to review and discuss the registry database schema and migrations. This is not going to be merged. We have selected only the relevant files (those with database queries or schema migrations) to reduce noise (thus the failing pipeline).

If interested, you can find the full source code in the master branch. The diff between master and release/2.9-gitlab (which contains our latest stable releases) shows all database related changes.

Structure

In the changes diff you'll find the following structure and files:

  • /db/er_model.png: This is the Entity-Relationship model of the database schema.

  • /migrations/*.go: These are the migration files that contain the database DDL. Internally, the registry manages migrations with the github.com/rubenv/sql-migrate tool (same as Gitaly Praefect). A CLI surfaces all of the admin functionality (documented here).

  • /registry/datastore/*.go: These are the Go files where we can find the SQL queries being used. The registry doesn't use any ORM, just the database/sql package from the Go standard library, a PostgreSQL driver (github.com/lib/pq) and raw SQL queries (so these are easy to locate and review, skipping non-relevant portions of the code if wanting to, like utility functions). The documentation for each function that has a SQL query is valuable to understand its purpose.

Where to Start

A review should start with the database ER model and the attached comment (link), which provides an overview of the schema. This is required to understand the concepts behind the design choices.

After that, it should be OK to move to the Changes tab and scroll through the migrations and then the SQL queries.

We have attached a query plan to each SQL query. We have populated a database (in GCP) with the dev.gitlab.org container registry metadata and used it to extract these (let us know if you need access). Along with each query, we detail how often we expect it to be used (load). This is done with the following comments for easy discoverability:

  • This is rarely used 🟢
  • This is heavily used 🔴
  • This is only used for tests 🟢 🧪
  • This is frequently used 🟡
  • This is frequently used (background jobs) 🟡
  • This is infrequently used 🔵
  • This is infrequently used (background jobs) 🔵

Notes

  • We're looking forward to receiving feedback in general and recommendations/concerns around the database schema and the DDL;
  • We're especially interested in the performance and scalability of the heavily and frequently used queries. They seem to perform well in general;
  • Although most of the queries listed here are only used for test purposes, we still included them here as they might provide valuable context/insight. These threads have been resolved to avoid too much noise from these;
  • We're curious about possible partitioning strategies and scalability in general (this might deserve a separate issue). This is the right time to make changes if we need to.

Related to #104 (closed)

Edited by João Pereira

Merge request reports