All jobs run for the same Extractor share the same state
At the moment, each tap stores its state in a single place: .meltano/run/{TAP_NAME}/state.json
)
That means that all jobs run for the same Tap, even with different job_id
s, share the same state.
How to reproduce:
- Use a tap that has a
start_date
(e.g. I tested usingtap_gitlab
) - Run an ELT with a
job_id
- Update the configuration of the Tap and set the
start_date
1 month in the past - Run a new ELT with a new
job_id
Expected Behavior: As this is a new Job with no state stored for it, we would expect it to run the full extraction starting in the new start_date
Why is this a problem and why do we need to solve this?
One of the most core premises of ELT pipelines is the ability to run an ELT pipeline for any date interval the user wants. Even if we do not support an end date for most Taps at the moment, the user must be able to set the start date according to her needs.
Why?
- Users can test Meltano with a 1 week extraction and check if it works. They will then want to go back and fetch all their data. This is not possible right now.
- The core Data Engineering problem: A user realizes that 2 days of data are missing (due to a scheduled job that failed or her running the ELT manually and making a mistake). She has to be able and re-fetch everything for those 2 days (or starting there and going to today if an end date is not provided by the tap)