Skip to content

Bulk import: split references pipeline into smaller workers

Madelein van Niekerk requested to merge 429609-split-referencing-pipeline into master

What does this MR do and why?

Closes #429609 (closed).

Splits BulkImports::Projects::Pipelines::ReferencesPipeline from a single worker into smaller workers behind a feature flag [Feature flag] Rollout of `bulk_import_async_re... (#430181 - closed). The pipeline is responsible for updating references in issue and MR descriptions and notes so that they are correctly mapped to the new project.

Before

A single worker was responsible for fetching all objects (issues, MRs and notes), building references and saving the objects. For one case, it took ~3.5 hours to complete.

After

The single worker approach was renamed to LegacyReferencesPipeline so that when the feature flag is not enabled, it is still in use.

ReferencesPipeline now is responsible for fetching all objects and enqueuing workers for each so that their refs can be updated async. The new workers are not blocking for the import - i.e. when a worker fails it will not fail the entire import but the failures are added to the import's failures.

Before the pipeline After the pipeline
MR description Screenshot 2023-11-09 at 08.43.26.png Screenshot 2023-11-09 at 08.47.43.png
MR note Screenshot 2023-11-09 at 08.43.32.png Screenshot 2023-11-09 at 08.47.48.png
Issue description Screenshot 2023-11-09 at 08.43.59.png Screenshot 2023-11-09 at 08.47.54.png
Issue note Screenshot 2023-11-09 at 08.44.07.png Screenshot 2023-11-09 at 08.48.01.png

Database queries

The database queries remained the same except for the following improvements:

  1. Instead of loading the whole issue, MR or note record, we now only select id
  2. Instead of stepping through issues and MRs twice (once for themselves and once for their notes), we loop through them once and load up notes within the same loop.

The resulting database queries are as follows for the gitlab project:

  • Loading up issues in batches:
    • 6.853 ms for SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 ORDER BY "issues"."iid" ASC LIMIT 1
    • 21.165 ms for SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" >= 1 ORDER BY "issues"."iid" ASC LIMIT 1 OFFSET 100
    • 191.533 ms for SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" >= 1 AND "issues"."iid" < 101
  • Issue notes:
    • 49.734 ms for SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' ORDER BY "notes"."id" ASC LIMIT 1
    • 14.630 ms for SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' AND "notes"."id" >= 1 ORDER BY "notes"."id" ASC LIMIT 1 OFFSET 100
    • 12.014 ms for SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' AND "notes"."id" >= 1 AND "notes"."id" < 101
  • Merge requests in batches (similar queries):
    • 23.701 ms
    • 62.736 ms
    • 684.578 ms
  • MR notes:
    • 68.414 ms
    • 14.846 ms
    • 15.246 ms

Because these queries have already been database reviewed and are performant enough, I don't think we need additional review on this.

How to set up and validate locally

  1. Disable the feature flag: Feature.disable(:bulk_import_async_references_pipeline).
  2. Import a group via the Direct Importer. Add refs to issues, MRs and notes on the projects being imported.
  3. Tail the importer logs to see a single worker for LegacyReferencesPipeline.
  4. Ensure that the refs are converted to links pointing to the new project.
  5. Enable the feature flag: Feature.enable(:bulk_import_async_references_pipeline).
  6. Import the group again.
  7. Tail the logs or view the sidekiq UI to see that a single worker called ReferencesPipeline is enqueued and then a TransformReferencesWorker for each issue, MR and note.
  8. Ensure that the refs are converted to links pointing to the new project.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #429609 (closed)

Edited by Madelein van Niekerk

Merge request reports