Make GitLab ELT Incremental
The current implementation of the Gitlab ELT is implemented as:
- Downloads CSV files from a GCS bucket
- Decompress the files
- Integrate the CSV using the corresponding strategy (
upsert
oroverwrite
)
The bulk of this work is in the Pseudonymizer
component of GitLab: we need to make it export only new data, from the last export. One way to do it would be to persist some kind of cursor (MAX(id) is a natural one for numeric id, MAX(created_date) can also work) and instead of walking through all the data set, start the extraction from this cursor.
We already output metadata files in the pseudonymizer run, we could either add this to the metadata, or create a cursor.yml
that tracks this.
The pseudonymizer would then:
- Read the provided
cursor
file (either provided at invocation or fetched from the latest run or default cursors) - Extract starting at the cursor
- Export the updated cursor
- Upload the
cursors
along the data
There should be a way to invalidate this cursor, for any of these cases (this might be a follow-up MR, you can manually delete the cursor file):
- An entity has changed:
- New entity
- New attribute
- Changed transformation