[Observes] differential gitlab ETL

Problem to solve

The gitlab ETL job for retrieving jobs & MR info from projects is consuming a large amount of time.

The principal reason of this problem is that the data is erased and created again (on the target) at every job.

Intended users

User experience goal

Proposal

assumptions

api returns an ordered list of objs
the list is in descending order
obj ids are assigned in ascending order, then newer obj have a greater id

facts

if new objs are added, pages will suffer an offset
then if page x is retrieved , the page x+1 may contain newer info than x and/or duplicated data of x, when x+1 is expected to be older and with no duplicates

algorithm

search page y that contains the minor element greater than last_greater_uploaded_id
start from y - max_pages
retrieve page x then retrieve x+1 but verify that the first obj satisfies:

firts_obj_on(x+1).id < last_obj_on(x).id
if not then iterate over objs/pages[i,j] until an obj satisfies

obj_at(page=j,obj=i) < last_obj_on(x).id
meanwhile execute newer_duplicated_found strategy (default: do nothing)
when condition is satisfied redo step 3 until last_greatest_uploaded_id is reach
update last_greatest_uploaded_id with greatest_id_uploaded

Permissions and Security

What does success look like, and how can we measure that?

Success is determined as time performance. But because currently the job has not been successfully completed, the success criteria is that portions of data reach the target in a way that do not corrupts data (by duplicates/omissions).

Links / references

Edited Oct 09, 2020 by Daniel Murcia

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information