Skip to content

Replication Key Signpost Behavior not consistent with expected behavior

Summary

My understanding was that the get_replication_key_signpost method is used to make sure that the replication key value that is stored is not above a certain value right? so for example, I overwrote the method to return 2021-08-04T19:08:35+00:00 (as a datetime object), but I see that the replication key value that is stored in my state is 2021-08-04T19:17:29Z which is greater than the replication key signpost value. Am I misunderstanding what the behavior of that method should be?

Steps to reproduce

  • Ingest a stream with any timestamp field that will be used for replication (can generate a stream with fake data)
  • In your stream class, override the get_replication_key_signpost method to return a datetime that is earlier than the highest date in your fake data
  • Make sure the stream is unsorted
  • Run ELT with a job_id, so state gets saved
  • Check the value of the replication_key in state, it will not be the get_replication_key_signpost value, but the highest value in the stream fake data

What is the current bug behavior?

//: # The timestamp of a stream stored in the state can be higher than what the get_replication_key_signpost method returns

What is the expected correct behavior?

//: # The timestamp of a stream stored in state should never be higher than what get_replication_key_signpost returns

Relevant logs and/or screenshots

//: # An example, is I ingested a stream that outputted the following data:

{"id": "1112931637383", "updatedAt": "2021-08-04T16:38:04Z" }
{"id": "1112931637323", "updatedAt": "2021-08-04T03:58:39Z"}
{"id": "111293163735", "updatedAt": "2021-08-04T11:25:30Z"}
{"id": "1112931637", "updatedAt": "2021-08-04T16:37:54Z"}
{"id": "1112931637343", "updatedAt": "2021-08-04T10:42:30Z"}
{"id": "1112931637393", "updatedAt": "2021-08-04T09:44:58Z"}
{"id": "111293163738211", "updatedAt": "2021-08-04T08:36:52Z"}
{"id": "11129316373812", "updatedAt": "2021-08-04T07:55:35Z"}
{"id": "11129316373817", "updatedAt": "2021-08-04T07:24:55Z"}
{"id": "11129316373892", "updatedAt": "2021-08-04T09:53:29Z"}

State File

{
  "bookmarks": {
    "fake_stream": {
      "replication_key": "updatedAt",
      "replication_key_value": "2021-08-04T16:38:04Z"
    }
  }
}

However the value of replication_key_signpost was 2021-08-02T00:00:00+00:00

Possible fixes

//: # Looks like this method gets the signpost and passes it to this method, but it is never use