Making singer-taps date-window compatible
Currently the Singer taps work using a start date, and integrate data from there.
They have no support for backfilling data, as the next run should always start where the last one finished.
| RUN 1 | RUN 2 | RUN 3 | ... | RUN X
------------------------------------------------> (time)
| (this is now)
When you want to backfill, you can't tell a tap to stop when it reached the data boundary (where there is data).
I think we want to change that to support date windows instead, by being able to stop a tap after it reached a certain timespan (either absolute or relative). Then you think of your job as chunk
of data, from [start, end]
.
Doing this will simplify a lot the orchestration between the job, because no state has to be tracked anymore by the tap.
| RUN X | RUN A | RUN X |
------------------------------------------------> (time)
| (this is now)
Here RUN A
will always cover the same chunk of data.
This is required if we want taps to work with Airflow.
Open questions:
- How do we deal with APIs that use the
updated_at
to filter results? It would then makeRUN A
not idempotent anymore (new data has been updated after the start_date) - How do we deal with data integration when a job fetch an entity in the past? I think it should not update it unless the update_at is greater than the existing row.
Edited by Yannis Roussos