Skip to content

WIP: Use de-duplication to reduce memory and amount of SQL queries for import

What does this MR do?

This uses de-duplication to:

  1. reduce amount of memory needed to hold hash, as there's a ton of duplication,

  2. re-uses already created relations instead of creating a new ones

This is based on: !18005 (merged) !18003 (merged) !18007 (merged) !18024 (merged)

Problems

We need to be careful when de-duplication can be used, as it can introduce hard to debug problems.

Lets consider the following example:

  "merge_requests": [
    {
      "id": 27,
      "target_branch": "feature",
      "source_branch": "feature_conflict",
      "source_project_id": 999,
      "author_id": 1,
      "merge_params": {
        "force_remove_source_branch": null
      },
      ...
      "resource_label_events": [
        {
          "id":243,
          "action":"add",
          "issue_id":null,
          "merge_request_id":27,
          "label_id":null,
          "user_id":1,
          "created_at":"2018-08-28T08:24:00.494Z"
        }
      ],

There are a problems with:

  1. merge_params, which might point to the same hash,
  2. resource_label_events (not here exactly, as there's unique id).

The merge_params case needs to be considered automatically, so de-duplication needs to understand whether the hierarchy it defines is linked top-level.

Ideally, it means that we should de-duplicate only objects on top-level, understanding that objects on lower levels could be re-used only if matching entry is found on top-level.

It means that we should consider creating de-dups only for relations that are:

  • labels => label,
  • milestones => milestone,
  • likely others as well

It reduces the efficiency, but should reduce the chance of going side-ways.

Does this MR meet the acceptance criteria?

Conformity

Edited by 🤖 GitLab Bot 🤖

Merge request reports