Add date-window compatibility

Context (from meltano#122 (closed))

Making singer-taps date-window compatible Currently the Singer taps work using a start date, and integrate data from there.

They have no support for backfilling data, as the next run should always start where the last one finished.

  | RUN 1   |  RUN 2  |  RUN 3  | ... |  RUN X  
------------------------------------------------> (time)
                                      | (this is now)

When you want to backfill, you can't tell a tap to stop when it reached the data boundary (where there is data).

I think we want to change that to support date windows instead, by being able to stop a tap after it reached a certain timespan (either absolute or relative). Then you think of your job as chunk of data, from [start, end].

Doing this will simplify a lot the orchestration between the job, because no state has to be tracked anymore by the tap.

|  RUN X  |   RUN A   |  RUN X  |
------------------------------------------------> (time)
                                      | (this is now)

Here RUN A will always cover the same chunk of data.

This is required if we want taps to work with Airflow.

Open questions:

  • How do we deal with APIs that use the updated_at to filter results? It would then make RUN A not idempotent anymore (new data has been updated after the start_date)
  • How do we deal with data integration when a job fetch an entity in the past? I think it should not update it unless the update_at is greater than the existing row.

Solution 1

Add an end_date parameter to the tap and let it handle the business logic for handling this date.

Pros

  • Simple, if the end_date is not provided, default to now.
  • Backward compatible: if end_date is provided but not handled, the tap will continue working as usual.

Cons

  • One size fits all approach, there is no granularity in the handling of the window, all entities are expected to be treated such as:
\begin{aligned}
& \text{Let } W = [start\_date, end\_date) \text{ // the date window} \\
& \text{Let } E_n = \text{any instance of entity type } n \\
& \forall  E_n, E_n.date \in W
\end{aligned}

config.js

{
  
  start_date: "2018-01-01",
  end_date: "2018-02-01"
}

Solution 2

Add a intervals (name TBD) parameter that define the window, and how the tap should interpret it.

Alternative names would be chunks, partition, segments.

config.js

{
  
  intervals: [
    // equivalent to [0, 5000[
    { type: 'offset', on: "a_certain_stream", from: 0, offset: 5000 },

    // equivalent to [2018-01-01, 2018-02-01[
    { type: 'date-window', on: "stream_b", from: "2018-01-01", to: "2018-02-01" },

    // equivalent to [now - 1 month, now[
    { type: 'date-offset', on: "last_stream", offset: "-1 month" }

    // equivalent to [2018-01-01, 2018-03-01[
    { type: 'date-offset', on: "last_stream", from: "2018-01-01", offset: "2 month" }
  ]
}

Each tap would be responsible for validating the specified intervals can be consumed by the stream it targets.

The --intervals option could also be provided to use intervals for a single run.

Pros

  • Fine grained control over each stream
  • Validation that the output stream can/can't support the window.
  • Isolated from the current Singer spec (start_date)

Cons

  • More complex
  • Need to implement helpers in singer-python to help cope with these new constructs
  • Bigger change for Singer
Edited by Micaël Bergeron