Schema management for isolating tables iteratively
This is a follow-up from !64289 (comment 612972450):
Let me summarize what I understand we're doing here (and please correct what I missed) and detail an alternative solution that does not require substantial changes to the migration methodology, keeps things simple in that regard (there's only one way of doing database migrations) and also helps us with self-managed.
I would like to ask to consider this proposal before we decide upon the MR at hand.
It also has implications beyond schema management, where I don't know what you've already discussed in full, but from a great discussion with @iroussos I believe this proposal plays to our needs in general.
I'll be using database:ci
and database:main
to talk about separate logical databases and connections to those databases.
Goals
This will be the easiest way to keep the schemas in sync since they will be the same schema. These schemas won't need to diverge until %15.0 and at that point we can create the different directory for migrations. ... GitLab.com the database schema will be the same anyway as we're doing it via streaming replication of the current main database so this avoids a weird divergence in production
- Keep exactly the same database schema between
database:main
anddatabase:ci
- Isolate tables: In this iteration,
ci_instance_variables
is identified asbelongs to CI
and should be isolated. More CI tables will be isolated like this. - Discovery: When looking at the
database:ci
database, I want to only seebelongs to CI
tables - Disallow cross-joins: When connected to
database:ci
, we want to error out hard when queries join across tables thatbelong to CI
and that don't.
Solution: Managed search path
This can all be done without introducing an additional database schema and additional migration methodology for the database:ci
database:
- We always deploy the same schema to
database:ci
anddatabase:main
(this is goal (1) after all) - keep doing what we already, usedb/structure.sql
and migrations as SSOT for that - We create an additional postgres schema (
CREATE SCHEMA gitlab_ci
) for CI - For a connection to
database:main
, we addgitlab_ci
to thesearch_path = public, gitlab_ci, ...
- For a connection to
database:ci
, we limit the search path tosearch_path = gitlab_ci
(goal (3) - limited discovery + goal (4) - disallow cross-joins) - We add a migration that moves
ci_instance_variables
togitlab_ci
schema (this is a metadata op:ALTER TABLE... SET SCHEMA gitlab_ci
(goal (2) - isolate table)
Managing the search_path
is supported by Rails (part of the connection config).
Expanding on goal (4) - disallow cross-joins: A connection to database:ci
will only "see" what's in the gitlab_ci
schema. An attempted select * from ci_instance_variables JOIN other_table_in_public
will error out hard, because this table is not visible.
Upon a later stage, when we actually want the database:ci
database schema to diverge, we introduce separate schema management for ci
. When we have that, we can simply drop the public
schema (with everything in it). Given the implications for self-managed below, we may not want to to do this step (we don't need it - there's no harm in having a few empty tables around
ci_*
How to iterate on extracting a group of tables, e.g. In this summary, we aim to extract CI-related tables and change the application to remove cross joins to unrelated tables. The goal is to be able to extract this set of tables into an entirely different database.
Steps:
- Create a schema inside
db:main
for the group, e.g.gitlab_ci
- Move relevant tables into this schema - transparent for the application (append search_path)
- Identify and remove cross joins to adapt the application
- Once compatible, use
db:ci
connection with asearch_path
limited togitlab_ci
to have an isolated view of the database for CI - Once this is proven to work, tables in
gitlab_ci
can actually be taken out ofdb:main
and moved intodb:ci
(if desired, e.g. GitLab.com). No application changes needed apart from changing the database connection config (different database)
All this can be done and shipped today, to both GitLab.com and self-managed. There is no need for infrastructure or omnibus changes supporting this.
Discussion
Benefits:
- All goals achieved
✅ - a soft separation of tables that is easy to deal with - We can start very quickly: Start with step (1) of the solution above and you'll already have
ci_instance_variables
available to you indatabase:ci
(albeit not isolated - this can be done in later steps) - Ability to selectively validate isolation changes avoiding big bang changes. See #334690 (comment 614694876).
- No dependency/support needed for infrastructure and omnibus changes to start with isolating application logic and tables (can all be deployed to GitLab.com/self-managed today)
- No change at all to how we manage the database schema (includes post-deploy/background migrations)
- Transparent to anybody else outside the "sharding" effort
- Self-managed: Can ship to self-managed just with regular migrations - no special treatment needed
- Self-managed: Huge plus (see below) - no need for separate databases unless desired
- Logical backups (pg_dump) continue to produce consistent dump if only one database
- We don't necessarily need #334227 (closed) to implement schema comparison checking - the schema will be based on the same migrations and hence be equivalent by design.
Weaknesses:
- We know how to work with different postgres schemas (we already do), but not all our custom migration helpers support it completely (groupdatabase to check and make sure this works). No urgency here since we don't expect to do a lot of migrations on
ci_instance_variables
soon. - If we wanted to have a different schema "on the other" side (in
database:ci
) to e.g. incorporate partitioning changes before we migrate data, this would need some work. Can be dealt with as needed and as we design the data migration parts later (it's not needed today).
Self-managed vs. GitLab.com
This is a huge plus for self-managed: They will only need one database, they don't need to migrate data to a separate database if that is not needed (for scaling purposes). A standard self-managed install keeps having one database only but has a few schemas: public, gitlab_ci, gitlab_AAA, gitlab_BBB, ...
. This can even result in using the same connection pool for the different roles (and configuring search_path
, so that all tables are visible or manage it upon "switching" connections) instead of having one connection pool per role (which multiplies connections), reducing resource needs.
Another benefit is that backups (pg_dump) continue to work as-is for installations that keep having one database only. With multiple databases, taking a consistent snapshot is more involved.
For GitLab.com (or another large install), we'll break the database apart and extract each postgres schema into its own database. At this point, we already know that the relevant parts of the app (e.g. CI) works fine when they only "see" their own schema, e.g. only CI tables. So this step is a matter of designing the data migration only (we don't have to deal with app changes anymore).
This means we can also validate "isolation changes" (making sure that the app works when it doesn't see all tables anymore, but only specific ones) quickly and iteratively - we can ship it to GitLab.com without having to have a separate database already (with all the mechanics this needs) and it easily ships to self-managed, too.