STATE support for incremental runs
As stated in the README, it'd be awesome if tap-google-analytics emitted STATE messages for "true" incremental support, instead of the date range based chunking it does now. This would allow for more efficient runs for big bulk imports right at the start, and it would remove the responsibility for some outside system to know what ranges have or haven't been extracted by using the more robust mechanism built right into Singer. I'm interested in contributing support for this but would love to agree on a direction before doing a bunch of work you fine folks think is wrong.
From the README:
The difficulty on that front is on dynamically deciding which attributes to use for capturing state for ad-hoc reports that do not include the
ga:datedimension or other combinations of Time Dimensions.
Difficult indeed! First off, I think it's good to mention that for reports that do include the ga:date dimension, we can use the value of that as the STATE for that stream, right? I think that's the right idea but I see a couple potential pitfalls:
- we'd have to be careful to emit the max date seen as part of the stream as the STATE, and only once we're sure all records for that date have been emitted as well. If an API call returned unsorted rows, or rows sorted by something other than the date, and we started emitting STATE messages with whatever the most recent date seen was, the STATE could potentially cover more dates than have actually been emitted so far. I'm not sure if that ever happens, but I think the safest thing to do would be to wait until the end of a stream's processing to emit any STATE messages at all. This is different than other HTTP API taps I've seen, but tap-google-analytics already works this way where it fetches all the records before emitting any, so I don't think it really matters.
- we'd have to be careful about emitting today's date, as today is not over yet and thus the data for it will change. I think that because the
end_dateconfig option defaults to yesterday this would not be an issue by default, but if someone passed in anend_dateof today, we'd extract and emit the records for today before today was over, and they'd need to be replaced next run. To solve this I'd suggest we either emit a STATE record with a maximum of yesterday's date, or forbidend_datefrom being today.
For the ad-hoc reports that don't have the date dimension / time dimension included, is there actually any guarantee they can even be processed incrementally? My belief would be that the answer is actually no, they can't. If it's some aggregate over the time window anchored in some other dimension, there's no knowing which rows might need to be updated since last time. For example, if the report is total visits by device form factor over the [start_date, end_date) range, there's no way to know which form factors may have changed or filter to just those as far as I can tell, because visits could have happened from any form factor since the last time the tap ran.
This makes me believe that for the ad-hoc reports lacking date/time dimensions, we actually shouldn't try to make them STATE-ful and instead just re-extract the whole report every time. If a user of the tap wants something to be incremental, then they need to specify a date dimension to make it incremental over, and then they can always aggregate back to the over-all-time structure they might need later. This would be implemented by emitting a placeholder STATE for the streams lacking this information indicating that there's no actual state to be preserved and that the next run shouldn't filter.
Let me know what you think of these tradeoffs and this solution to the date dimensionless problem and I can prepare an MR!