Decide approach for generic multi-database features
The decomposition effort will require existing code to work with multiple physical databases. For most application logic, this means separating concerns between them, for example breaking joins or transactions between main
and ci
data.
The database group supports a number of features which don't fall under a similar pattern. These are generic database maintenance or migration tasks that will capable of connecting and running across all databases.
Database features
At a high level we can break these down into a few features and define them:
Migration / Migration helpers
Custom migration code we use to support the safe database changes, no downtime deployments, and more advanced features not provided by Rails. There are also helpers here to support some of the following features, like background migrations and partitioning.
Execution contexts: migrations
Background migrations
Code which executes the background migration jobs. This includes both the familiar background migrations, as well as newer batched background migrations. We also have migration helpers to support this, although for the most part that could be included in the above category. However, there are migration helpers which force execution of background migration jobs inline in the context of the migration, so we have to be aware of that.
Execution context: migrations, sidekiq
Partitioning
Processes which automatically manage creation and cleanup of dynamic partitions. Currently the only form of dynamic partitions is time-based (monthly). There are two main components. The first component creates new partitions ahead of their use, and detaches old partitions that are no longer needed. The second component drops partitions that have been detached and are ready for removal.
Execution contexts: sidekiq, initializer, rake
Reindexing
Process which periodically reindexes database indexes to reduce bloat.
Execution contexts: rake (from system cron)
Multi-database execution
In addition to needing multi-database execution, all of the above have their own supporting models used to track or query state. Some of these are also shared between contexts, for example background migration tables are both referenced from the background migration runner or individual jobs, and also migrations.
In some instances the models are backed by views over PG catalog tables, so we can easily query information about partitions, index bloat, etc.
There are two main concerns we need to discuss, which may be related but can possibly be seen as two separate issues:
- On which database do we store state for multi-database features, and
- How do we run these features in a context of multiple databases
Supporting tables/models
I think we have two options here:
- Create identical tables in each database, which is close to the data it operates on. For some models (like those backed by PG catalog), we don't have any other way to do it.
- Keep the tables on the main database only, with added information to understand where the operations need to occur.
The current approach so far has been to consider having those tables duplicated on each database. As mentioned above, this is a requirement for some to see the correct data. For some like background migrations, it's not as clear cut. The schema_migrations
table will have to exist on each database, so we can see background migration information as an extension of that. This could also make it easier to do transactional work, because the background migration executing jobs against CI does not have to reach across to the main database to update tracking tables.
The downside to this approach is updating the correct data in the correct location, we have to be very careful about connection management.
Run features in multi-database setup
Likewise I think we have two options here (although all features might not have to follow the same approach):
- Keep the database in-sync, and run all features across all databases until we physically separate them.
- Run only what we need where we need it
Considering option 1, we already keep the schema in sync between main and ci, and share a single structure.sql
, so this follows that philosophy. If we plan to run "regular" (schema changing) migrations against both databases, it would be consistent to also create partitions there as well. Also, if we run migrations that enqueue background migrations against both databases, but not actually run the jobs on both, that has an impact later when we finish or cleanup those migrations.
Considering option 2, this seems to fit better with the end goal of what we want to accomplish. With a split database in production, we don't want to run the same operations against all the databases. Prior to the physical split we have replication to maintain the data between the two clusters. After the physical split, we begin to run only what we need where want, and let the non-used tables degrade until we truncate/remove them.
Technical challenges
We also have the final piece, which is how do we implement all this? We have already taken two approaches for partitioning, neither of which works without a fair bit of complexity. Ultimately in the case of using PG catalog, we have models which need to be shared across both databases, so we need to solve connection management.
For migration helpers, this doesn't matter as much currently because Rails overrides connections there, giving us what we need as the default. For sidekiq/application code, we need to be explicit about passing the correct connection throughout to get consistent results, especially within a transaction scope.
In background migrations, we use migration-local models or anonymous models to do database work, which typically use ActiveRecord::Base
. We need to override connection defaults here to have the models use the correct connection for the context of execution.