Skip to content

WIP: Group Export relations & attributes

George Koltsov requested to merge georgekoltsov/14239-group-import-export into master

Background information

A lot of existing customers are looking for Group Import/Export functionality. Problem statement can be found here: #14239 (comment 217193938)

With githost.io shutting down this has become a pressing issue, with this project becoming a top priority for Import team (currently, it's a team of one). Customers with big groups would like to be able to export groups, with all relations (epics, labels, variables, etc) as well as activity, packages and projects preserved.

A full list of Group I/E MVC 'must haves' can be found here: #14239 (comment 219075229)

My initial thoughts on Group I/E can be found here: #14239 (comment 219427700) but the biggest one is because Group is a much bigger entity than Project with potentially hundreds of projects (there are customers with such setups (#14239 (comment 210488104)) it made sense to start thinking about distributing work across multiple workers.

Main reasons are:

  • Reduce the risk of memory consumption issues
  • Reduce the time it takes to perform I/E by allowing it to be performed in parallel
  • Not occupy one sidekiq worker for prolonged periods of time while Group I/E is running (however it might still be a risk with project exports within group export, as Project I/E was going to remain the same, with slight tweaks)

Given that our current Project I/E is ran within one sidekiq job we do not have existing mechanism of distributing / tracking work across multiple workers. That means there is a need in creating one.

Requirements (that I came up with myself):

  1. Split I/E of various entities across multiple sidekiq jobs with an ability to track state
  2. Each I/E entity has to be as small as possible (e.g. 1 package at a time, 1 relation at a time, etc) to reduce memory consumption
  3. Perform I/E in batches, to not occupy too many workers at a time, to prevent resource starvation
  4. Utilise existing Project I/E functionality, with few tweaks to allow it to be part of Group I/E

Key concepts

  1. Group Export has many group parts (naming is hard, I called it steps first but then renamed to parts, can change if needed)
  2. Each part is a small(ish) unit to be exported/imported (e.g. top level relation of a group, a package, a project (as it currently stands))
  3. Each part's attributes contain:
  • status (created/started/finished/etc)
  • name of a part describing what is it it's exporting (e.g. relation, package, project)
  • params containing what is it that needs to be exported (e.g. relation hash {include: { labels: { include: ... } } }
  • timestamps
  • jid
  1. Both group and group part have status (implemented using state machine) to keep track of each individual part's progress, and Group export process overall
  2. New group/import_export.yml that is similar to project one, however it introduces a new section, specifying not only group tree of relations, but also a list of things to be included in the export / import. Example:
include:
  - :relations
  - :attributes
  - :packages
  - :projects

This is arguably not needed, as project's yml file does not specify things like lfs, avatars, wiki, etc in it's file.

High level activity (not fully implemented in this MR)

image

Sequence diagram

I made a sequence diagram since I find it most useful. However due to it's size I split it in multiple parts. Export Creation, Export Parts Scheduling, Export Parts processing and Export Parts Batch Complete. Let me know if you would like to see it in a different format.

Export Creation (start of the flow)

image

Export Parts Scheduling

image

Export Parts Processing (Relations example)

image

Export Parts Batch Complete Callback

image

So far current MR (which is still WIP) introduces:

  • Export of Group relations
  • Across multiple sidekiq jobs
  • In batches
  • Writes group relations in separate files

It still does not cover:

  • Attribute cleaning (all ids are present)
  • Failure recovery (whenever job fails, only last_error attribute is updated)
  • and much more that I haven't yet thought of ...

Demo: https://drive.google.com/file/d/1furQL2BnzUx2_MAUUmex_jqIcStCkqxm/view?usp=sharing

Closes #32931 (closed)

Edited by 🤖 GitLab Bot 🤖

Merge request reports