Add date-window compatibility
Context (from meltano#122 (closed))
Making singer-taps date-window compatible Currently the Singer taps work using a start date, and integrate data from there.
They have no support for backfilling data, as the next run should always start where the last one finished.
| RUN 1 | RUN 2 | RUN 3 | ... | RUN X
------------------------------------------------> (time)
| (this is now)
When you want to backfill, you can't tell a tap to stop when it reached the data boundary (where there is data).
I think we want to change that to support date windows instead, by being able to stop a tap after it reached a certain timespan (either absolute or relative). Then you think of your job as chunk of data, from [start, end].
Doing this will simplify a lot the orchestration between the job, because no state has to be tracked anymore by the tap.
| RUN X | RUN A | RUN X |
------------------------------------------------> (time)
| (this is now)
Here RUN A will always cover the same chunk of data.
This is required if we want taps to work with Airflow.
Open questions:
- How do we deal with APIs that use the updated_at to filter results? It would then make RUN A not idempotent anymore (new data has been updated after the start_date)
- How do we deal with data integration when a job fetch an entity in the past? I think it should not update it unless the update_at is greater than the existing row.
Solution 1
Add an end_date parameter to the tap and let it handle the business logic for handling this date.
Pros
- Simple, if the end_date is not provided, default to
now. - Backward compatible: if
end_dateis provided but not handled, the tap will continue working as usual.
Cons
- One size fits all approach, there is no granularity in the handling of the window, all entities are expected to be treated such as:
\begin{aligned}
& \text{Let } W = [start\_date, end\_date) \text{ // the date window} \\
& \text{Let } E_n = \text{any instance of entity type } n \\
& \forall E_n, E_n.date \in W
\end{aligned}
config.js
{
…
start_date: "2018-01-01",
end_date: "2018-02-01"
}
Solution 2
Add a intervals (name TBD) parameter that define the window, and how the tap should interpret it.
Alternative names would be
chunks,partition,segments.
config.js
{
…
intervals: [
// equivalent to [0, 5000[
{ type: 'offset', on: "a_certain_stream", from: 0, offset: 5000 },
// equivalent to [2018-01-01, 2018-02-01[
{ type: 'date-window', on: "stream_b", from: "2018-01-01", to: "2018-02-01" },
// equivalent to [now - 1 month, now[
{ type: 'date-offset', on: "last_stream", offset: "-1 month" }
// equivalent to [2018-01-01, 2018-03-01[
{ type: 'date-offset', on: "last_stream", from: "2018-01-01", offset: "2 month" }
]
}
Each tap would be responsible for validating the specified intervals can be consumed by the stream it targets.
The --intervals option could also be provided to use intervals for a single run.
Pros
- Fine grained control over each stream
- Validation that the output stream can/can't support the window.
- Isolated from the current Singer spec (
start_date)
Cons
- More complex
- Need to implement helpers in
singer-pythonto help cope with these new constructs - Bigger change for Singer