Documentation proposal: how to perform "non-destructive" data loads
Background
For the purpose of this ticket, a "non-destructive" data load is a load which does not permanently delete or replace (aka does not "destroy") existing data when loading new data. As summarized mentioned in this slack message:
Most taps are upsert only; either FULL_TABLE (all records) or INCREMENTAL (only records on/after the bookmark).
However, a common requirement for data engineering pipelines is to retain all prior versions of records, which allows full capability to build slowly changing dimensions with the raw output data.
Some Possible Solutions
There may be other options I'm not aware of but there are a few basic methods I've identified:
- If the stream supports incremental extracts: Run in
incremental
mode with forcednull
key_properties
(primary key).- By nulling out the primary key indicator, we prevent the target from knowing how to run a merge upsert, essentially forcing
append
behavior instead ofupsert
.
- By nulling out the primary key indicator, we prevent the target from knowing how to run a merge upsert, essentially forcing
- If the stream does not support incremental extract: Run in
full_table
mode with forcednull
key_properties
(primary key).- As with above, we prevent the upsert behavior by nulling out the primary key.
- Caution: It is up to the particular tap implementation to determine whether the target is fully-replaced (destructive) or appended (non-destructive). I don't believe there's a consistent spec on this behavior so the user must test/evaluate the target to ensure "append" behavior is performed and not "replace-all".
- Without a replication key, the amount of duplicated data will be very significant. Some streams can tolerate this amount of duplication but in an ideal scenario the target would have some mode for inserting only those records which have changed.
- Regardless of extract mode, the target must support adding
_sdc_*
metadata columns such as the batching timestamp. These timestamps are critical to allow the downstream transformation system to identify which row versions are newest and which row versions are "replaced" by newer versions.- In SQL, this would is commonly expressed as one or both of the following in a downstream system such as dbt:
-
ROW_NUMBER() over (PARTITION by <id_cols> ORDER BY _sdc_batched_at DESC) as recency_rank
- (1
= latest) -
ROW_NUMBER() over (PARTITION by <id_cols> ORDER BY _sdc_batched_at ASC) as record_version_num
- (1
= first)
-
- In SQL, this would is commonly expressed as one or both of the following in a downstream system such as dbt:
There may be other methods I'm not aware of but the above hopefully describes the basic landscape of the problem we would be trying to solve.
Table stakes for a user implementing this approach a decision that deduping will be performed in the transformation layer based on _sdc_batched_at
timestamp column (or a similar method).
Objective
In opening this ticket, I'm hoping to start a conversation regarding best practices for "non-destructive loads". Based on our findings and the outcome of our analysis here, I hope that we can create documentation (or link to existing docs if they exist) which make this more straightforward for users of the Meltano and Singer ecosystems.