New docs page: 'Sync Strategies' / 'Methods of Replication'
In conversations regarding separation of concerns between the Singer Spec and Meltano, a concept of 'sync strategies' has recurred, and is therefore worth documenting at this stage!
From this viewpoint, Singer is responsible for abstracting away the specifics of interrogating sources/destinations and the mechanics of moving data between them. What to move, and how to optimise that transfer (as a sequence of executions of individual pipelines) necessarily happens an abstraction layer above in the orchestrators context. Therefore some things should/will never make it into the Singer Spec, and we will need to invent some other construct for representing these sequences of executions referred to as 'sync strategies'.
The examples below are inspired mainly by database>data warehouse integrations, but have equivalents in other source>destination combinations. It is therefore not intended to be exhaustive at this stage; if you have experience with other source>destination types, please describe the strategies you followed
Backfill
Basic Backfill
- A single task runs on selected streams using the
FULL_TABLE
replication mode. - A schedule is created after completion of the one-off
FULL_TABLE
backfill in eitherINCREMENTAL
orLOG_BASED
replication mode.
Note: for LOG_BASED
in most database engines (MySQL, Postgres etc.) setup is required to enable bin-log decoding before initiating a FULL_TABLE
sync, to ensure changes made during the backfill are captured. As a result, some duplication may occur between the time logical decoding is enabled and the backfill begins.
Basic Backfill+
- A single task runs on selected streams using the
FULL_TABLE
replication mode with anend_date
configured. - A schedule is created in either
INCREMENTAL
orLOG_BASED
replication mode with astart_date
configured to match the backfillend_date
and runs simultaneously with the one-off backfill.
The intention here is that new changes arrive at the same time as historical data is backfilled. For INCREMENTAL
streams, the end_date
/start_date
can be some arbitrary time in the recent past (say beginning of month/year/quarter). For LOG_BASED
, the limiting factor is availability of decoded log files.
Intelligent Backfill
- A discovery step assesses selected streams (tables) to determine the most effective backfill strategy. 'Small' tables are replicated using a one-off
FULL_TABLE
task.Large
tables are split into multiple tasks, each replicating an equal sized chunk of historical data (partitioned bystart_date
/end_date
or by range of sequential IDs), in order of newest chunk to oldest (e.g.[2020, 2019, 2018, 2017, ...]
). - A schedule is created in either
INCREMENTAL
orLOG_BASED
replication mode with astart_date
configured to match the backfillend_date
and runs simultaneously with the one-off backfill.
The intention here is that not only do new changes arrive whilst historical data is being backfilled, historical data arrives in parallelised chunks ordered from newest to oldest (rather than the default oldest to newest ordering of recovery-compatible FULL_TABLE
syncs).
Parallel Sync
Basic Parallelism
- Streams are synced with one-stream-per-task.
Rule-based Parallelism
- Parallelism is determined according to some selection criteria provided by the user, with filter fields that include stream name and replication mode (e.g. all
FULL_TABLE
syncs grouped, all tables matchingcustomer*
in individual jobs etc.)
Melturbo
- Automatic determination of parallelisation strategy, based on an assessment including no. of records in source, avg. no. records transferred per job (over some number of runs?) and the availability of primary and incremental keys (to determine replication mode). This assessment is done periodically by Meltano to remain performant as data volumes change over time.