Schema management for isolating tables iteratively

This is a follow-up from !64289 (comment 612972450):

Let me summarize what I understand we're doing here (and please correct what I missed) and detail an alternative solution that does not require substantial changes to the migration methodology, keeps things simple in that regard (there's only one way of doing database migrations) and also helps us with self-managed.

I would like to ask to consider this proposal before we decide upon the MR at hand.

It also has implications beyond schema management, where I don't know what you've already discussed in full, but from a great discussion with @iroussos I believe this proposal plays to our needs in general.

I'll be using database:ci and database:main to talk about separate logical databases and connections to those databases.

Goals

This will be the easiest way to keep the schemas in sync since they will be the same schema. These schemas won't need to diverge until %15.0 and at that point we can create the different directory for migrations. ... GitLab.com the database schema will be the same anyway as we're doing it via streaming replication of the current main database so this avoids a weird divergence in production

Keep exactly the same database schema between database:main and database:ci
Isolate tables: In this iteration, ci_instance_variables is identified as belongs to CI and should be isolated. More CI tables will be isolated like this.
Discovery: When looking at the database:ci database, I want to only see belongs to CI tables
Disallow cross-joins: When connected to database:ci, we want to error out hard when queries join across tables that belong to CI and that don't.

Solution: Managed search path

This can all be done without introducing an additional database schema and additional migration methodology for the database:ci database:

We always deploy the same schema to database:ci and database:main (this is goal (1) after all) - keep doing what we already, use db/structure.sql and migrations as SSOT for that
We create an additional postgres schema (CREATE SCHEMA gitlab_ci) for CI
For a connection to database:main, we add gitlab_ci to the search_path = public, gitlab_ci, ...
For a connection to database:ci, we limit the search path to search_path = gitlab_ci (goal (3) - limited discovery + goal (4) - disallow cross-joins)
We add a migration that moves ci_instance_variables to gitlab_ci schema (this is a metadata op: ALTER TABLE... SET SCHEMA gitlab_ci (goal (2) - isolate table)

Managing the search_path is supported by Rails (part of the connection config).

Expanding on goal (4) - disallow cross-joins: A connection to database:ci will only "see" what's in the gitlab_ci schema. An attempted select * from ci_instance_variables JOIN other_table_in_public will error out hard, because this table is not visible.

Upon a later stage, when we actually want the database:ci database schema to diverge, we introduce separate schema management for ci. When we have that, we can simply drop the public schema (with everything in it). Given the implications for self-managed below, we may not want to to do this step (we don't need it - there's no harm in having a few empty tables around 🤷 ).

How to iterate on extracting a group of tables, e.g. `ci_*`

In this summary, we aim to extract CI-related tables and change the application to remove cross joins to unrelated tables. The goal is to be able to extract this set of tables into an entirely different database.

Steps:

Create a schema inside db:main for the group, e.g. gitlab_ci
Move relevant tables into this schema - transparent for the application (append search_path)
Identify and remove cross joins to adapt the application
Once compatible, use db:ci connection with a search_path limited to gitlab_ci to have an isolated view of the database for CI
Once this is proven to work, tables in gitlab_ci can actually be taken out of db:main and moved into db:ci (if desired, e.g. GitLab.com). No application changes needed apart from changing the database connection config (different database)

All this can be done and shipped today, to both GitLab.com and self-managed. There is no need for infrastructure or omnibus changes supporting this.

Discussion

Benefits:

All goals achieved ✅ - a soft separation of tables that is easy to deal with
We can start very quickly: Start with step (1) of the solution above and you'll already have ci_instance_variables available to you in database:ci (albeit not isolated - this can be done in later steps)
Ability to selectively validate isolation changes avoiding big bang changes. See #334690 (comment 614694876).
No dependency/support needed for infrastructure and omnibus changes to start with isolating application logic and tables (can all be deployed to GitLab.com/self-managed today)
No change at all to how we manage the database schema (includes post-deploy/background migrations)
Transparent to anybody else outside the "sharding" effort
Self-managed: Can ship to self-managed just with regular migrations - no special treatment needed
Self-managed: Huge plus (see below) - no need for separate databases unless desired
Logical backups (pg_dump) continue to produce consistent dump if only one database
We don't necessarily need #334227 (closed) to implement schema comparison checking - the schema will be based on the same migrations and hence be equivalent by design.

Weaknesses:

We know how to work with different postgres schemas (we already do), but not all our custom migration helpers support it completely (groupdatabase to check and make sure this works). No urgency here since we don't expect to do a lot of migrations on ci_instance_variables soon.
If we wanted to have a different schema "on the other" side (in database:ci) to e.g. incorporate partitioning changes before we migrate data, this would need some work. Can be dealt with as needed and as we design the data migration parts later (it's not needed today).

Self-managed vs. GitLab.com

This is a huge plus for self-managed: They will only need one database, they don't need to migrate data to a separate database if that is not needed (for scaling purposes). A standard self-managed install keeps having one database only but has a few schemas: public, gitlab_ci, gitlab_AAA, gitlab_BBB, .... This can even result in using the same connection pool for the different roles (and configuring search_path, so that all tables are visible or manage it upon "switching" connections) instead of having one connection pool per role (which multiplies connections), reducing resource needs.

Another benefit is that backups (pg_dump) continue to work as-is for installations that keep having one database only. With multiple databases, taking a consistent snapshot is more involved.

For GitLab.com (or another large install), we'll break the database apart and extract each postgres schema into its own database. At this point, we already know that the relevant parts of the app (e.g. CI) works fine when they only "see" their own schema, e.g. only CI tables. So this step is a matter of designing the data migration only (we don't have to deal with app changes anymore).

This means we can also validate "isolation changes" (making sure that the app works when it doesn't see all tables anymore, but only specific ones) quickly and iteratively - we can ship it to GitLab.com without having to have a separate database already (with all the mechanics this needs) and it easily ships to self-managed, too.

Edited Jun 29, 2021 by Andreas Brandl