[Observes] differential gitlab ETL
Problem to solve
The gitlab ETL job for retrieving jobs & MR info from projects is consuming a large amount of time.
The principal reason of this problem is that the data is erased and created again (on the target) at every job.
Intended users
User experience goal
Proposal
assumptions
- api returns an ordered list of objs
- the list is in descending order
- obj ids are assigned in ascending order, then newer obj have a greater id
facts
- if new objs are added, pages will suffer an offset
- then if page x is retrieved , the page x+1 may contain newer info than x and/or duplicated data of x, when x+1 is expected to be older and with no duplicates
algorithm
-
search page
y
that contains the minor element greater thanlast_greater_uploaded_id
-
start from
y - max_pages
-
retrieve page x then retrieve x+1 but verify that the first obj satisfies:
firts_obj_on(x+1).id < last_obj_on(x).id
-
if not then iterate over objs/pages[i,j] until an obj satisfies
obj_at(page=j,obj=i) < last_obj_on(x).id
-
meanwhile execute
newer_duplicated_found strategy
(default: do nothing) -
when condition is satisfied redo step 3 until
last_greatest_uploaded_id
is reach -
update
last_greatest_uploaded_id
withgreatest_id_uploaded
Permissions and Security
What does success look like, and how can we measure that?
Success is determined as time performance. But because currently the job has not been successfully completed, the success criteria is that portions of data reach the target in a way that do not corrupts data (by duplicates/omissions).