Replication Key Signpost Behavior not consistent with expected behavior
Summary
My understanding was that the get_replication_key_signpost
method is used to make sure that the replication key value that is stored is not above a certain value right? so for example, I overwrote the method to return 2021-08-04T19:08:35+00:00 (as a datetime object), but I see that the replication key value that is stored in my state is 2021-08-04T19:17:29Z which is greater than the replication key signpost value. Am I misunderstanding what the behavior of that method should be?
Steps to reproduce
- Ingest a stream with any timestamp field that will be used for replication (can generate a stream with fake data)
- In your stream class, override the
get_replication_key_signpost
method to return a datetime that is earlier than the highest date in your fake data - Make sure the stream is unsorted
- Run ELT with a job_id, so state gets saved
- Check the value of the replication_key in state, it will not be the
get_replication_key_signpost
value, but the highest value in the stream fake data
What is the current bug behavior?
//: # The timestamp of a stream stored in the state can be higher than what the get_replication_key_signpost
method returns
What is the expected correct behavior?
//: # The timestamp of a stream stored in state should never be higher than what get_replication_key_signpost
returns
Relevant logs and/or screenshots
//: # An example, is I ingested a stream that outputted the following data:
{"id": "1112931637383", "updatedAt": "2021-08-04T16:38:04Z" }
{"id": "1112931637323", "updatedAt": "2021-08-04T03:58:39Z"}
{"id": "111293163735", "updatedAt": "2021-08-04T11:25:30Z"}
{"id": "1112931637", "updatedAt": "2021-08-04T16:37:54Z"}
{"id": "1112931637343", "updatedAt": "2021-08-04T10:42:30Z"}
{"id": "1112931637393", "updatedAt": "2021-08-04T09:44:58Z"}
{"id": "111293163738211", "updatedAt": "2021-08-04T08:36:52Z"}
{"id": "11129316373812", "updatedAt": "2021-08-04T07:55:35Z"}
{"id": "11129316373817", "updatedAt": "2021-08-04T07:24:55Z"}
{"id": "11129316373892", "updatedAt": "2021-08-04T09:53:29Z"}
State File
{
"bookmarks": {
"fake_stream": {
"replication_key": "updatedAt",
"replication_key_value": "2021-08-04T16:38:04Z"
}
}
}
However the value of replication_key_signpost was 2021-08-02T00:00:00+00:00
Possible fixes
//: # Looks like this method gets the signpost and passes it to this method, but it is never use