Skip to content

Update Import process to persist object subrelations in separate db transactions

What does this MR do and why?

This MR updates the way Project Import persists objects in the database in order to reduce db transaction times as mentioned in #343458 (closed)

Each 'top level relation' object has nested subrelations within it. For example: a project issue with 5000 notes in it. When this issue is saved - a new transaction is opened and all of the notes are inserted before transaction is committed. This can take some time to execute.

Current approach

Whenever relation_object.save! is called -- a new transaction is opened and ActiveRecord persists all subrelations as part of main top level relation. Example of 1 MR with 5 notes. Note 1 transaction in total.

relation_object.save!

irb(#<Gitlab::ImportExport::Project::RelationTreeRestorer:0x00007fd4ae7780c8>):005:0> relation_object.save!
D, [2022-02-15T16:46:35.736020 #58940] DEBUG -- :   TRANSACTION (0.2ms)  BEGIN 

D, [2022-02-15T16:46:35.792512 #58940] DEBUG -- :   MergeRequest Create (1.1ms)  INSERT INTO "merge_requests" ("target_branch", "source_branch", "source_project_id", "author_id", "title", "created_at", "updated_at", "merge_status", "target_project_id", "iid", "description", "merge_params", "title_html", "description_html", "cached_markdown_version", "lock_version", "time_estimate") VALUES ('main', 'reported_user_18-main-patch-97838', 349, 40, 'Update README.md', '2022-02-03 20:15:14.447000', '2022-02-03 20:20:38.214000', 'can_be_merged', 349, 1, '123', '---
force_remove_source_branch: ''1''
', 'Update README.md', '<p data-sourcepos="1:1-1:3" dir="auto">123</p>', 1900544, 0, 0) RETURNING "id" /*application:console,correlation_id:c5c2fb83f9886326d9f7703b49ca3ab0,db_config_name:main,line:/lib/gitlab/import_export/group/relation_tree_restorer.rb:5:in `save_relation_object'*/
D, [2022-02-15T16:46:35.819502 #58940] DEBUG -- :   Note Create (0.7ms)  INSERT INTO "notes" ("note", "noteable_type", "author_id", "created_at", "updated_at", "project_id", "noteable_id", "discussion_id", "note_html", "cached_markdown_version", "last_edited_at", "system") VALUES ('a126a29dcc5a41c534a54c9129e658b6

 *By Administrator on 2022-02-03T20:16:52 (imported from GitLab)*', 'MergeRequest', 1, '2022-02-03 20:16:52.977000', '2022-02-03 20:16:52.977000', 349, 7583, '2e4d53b3181f9a9f6d2d170d85223dcd88002b6e', '<p data-sourcepos="1:1-1:32" dir="auto">a126a29dcc5a41c534a54c9129e658b6</p>
<p data-sourcepos="3:2-3:65" dir="auto"><em>By Administrator on 2022-02-03T20:16:52 (imported from GitLab)</em></p>', 1900544, '2022-02-03 20:16:52.977000', FALSE) RETURNING "id" /*application:console,correlation_id:c5c2fb83f9886326d9f7703b49ca3ab0,db_config_name:main,line:/lib/gitlab/import_export/group/relation_tree_restorer.rb:5:in `save_relation_object'*/

...

D, [2022-02-15T16:46:35.922181 #58940] DEBUG -- :   TRANSACTION (0.3ms)  COMMIT /*application:console,correlation_id:c5c2fb83f9886326d9f7703b49ca3ab0,db_config_name:main,line:/lib/gitlab/database.rb:298:in `commit'*/

=> true

Proposed approach

Separate all collection subrelations from main 'top level relation'. Save top level relation first and then save all collection subrelations in batches of 100. This should reduce open db transaction times. Example of 1 MR with 5 notes and batch size of 1. Note 1 transaction for main relation object & one transaction for each batch of subrelations.

A nice side effect of this approach is an ability to still import top level relation even if one of the subrelations is invalid. For example: if an award emoji of a merge request is invalid - the current approach will fail to import the entire merge request, while proposed approach will fail to import award emoji, while will still import the MR.

A not nice side effect: if there is an invalid record in the batch - the entire batch will likely to fail to be imported. But there might be a way to still process other records.

relation_object.save!

D, [2022-02-15T16:39:25.983691 #58940] DEBUG -- :   TRANSACTION (0.3ms)  BEGIN /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/import_export/group/relation_tree_restorer.rb:4:in `save_relation_object'*/

...

D, [2022-02-15T16:39:26.397535 #58940] DEBUG -- :   TRANSACTION (0.4ms)  COMMIT /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:298:in `commit'*/

=> true

save_subrelations # example of saving 1 note at a time

D, [2022-02-15T16:41:31.453959 #58940] DEBUG -- :   TRANSACTION (0.2ms)  BEGIN 
D, [2022-02-15T16:41:31.565703 #58940] DEBUG -- :   Note Create (1.5ms)  INSERT INTO "notes" ("note", "noteable_type", "author_id", "created_at", "updated_at", "project_id", "noteable_id", "discussion_id", "note_html", "cached_markdown_version", "last_edited_at", "system") VALUES ('a126a29dcc5a41c534a54c9129e658b6

 *By Administrator on 2022-02-03T20:16:52 (imported from GitLab)*', 'MergeRequest', 1, '2022-02-03 20:16:52.977000', '2022-02-03 20:16:52.977000', 348, 7582, '2e4d53b3181f9a9f6d2d170d85223dcd88002b6e', '<p data-sourcepos="1:1-1:32" dir="auto">a126a29dcc5a41c534a54c9129e658b6</p>
<p data-sourcepos="3:2-3:65" dir="auto"><em>By Administrator on 2022-02-03T20:16:52 (imported from GitLab)</em></p>', 1900544, '2022-02-03 20:16:52.977000', FALSE) RETURNING "id" /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:265:in `block in transaction'*/
D, [2022-02-15T16:41:31.567035 #58940] DEBUG -- :   TRANSACTION (0.2ms)  COMMIT /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:298:in `commit'*/
D, [2022-02-15T16:41:31.569550 #58940] DEBUG -- :   License Load (0.2ms)  SELECT "licenses".* FROM "licenses" ORDER BY "licenses"."id" DESC LIMIT 100 /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/ee/app/models/license.rb:338:in `load_license'*/
D, [2022-02-15T16:41:31.590747 #58940] DEBUG -- :   TRANSACTION (0.1ms)  BEGIN /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:265:in `block in transaction'*/
D, [2022-02-15T16:41:31.591560 #58940] DEBUG -- :   Note Create (0.6ms)  INSERT INTO "notes" ("note", "noteable_type", "author_id", "created_at", "updated_at", "project_id", "noteable_id", "discussion_id", "note_html", "cached_markdown_version", "last_edited_at", "system") VALUES ('f0b7d7b2f204909a3b8bdf92a40db6d9

 *By Administrator on 2022-02-03T20:16:53 (imported from GitLab)*', 'MergeRequest', 1, '2022-02-03 20:16:53.028000', '2022-02-03 20:16:53.028000', 348, 7582, 'bf5ee1c5a1a659c97a68932c4bd623326af8df92', '<p data-sourcepos="1:1-1:32" dir="auto">f0b7d7b2f204909a3b8bdf92a40db6d9</p>
<p data-sourcepos="3:2-3:65" dir="auto"><em>By Administrator on 2022-02-03T20:16:53 (imported from GitLab)</em></p>', 1900544, '2022-02-03 20:16:53.028000', FALSE) RETURNING "id" /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:265:in `block in transaction'*/
D, [2022-02-15T16:41:31.592640 #58940] DEBUG -- :   TRANSACTION (0.2ms)  COMMIT /*application:console,correlation_id:04194364b939a7318222df418a1c6bbc,db_config_name:main,line:/lib/gitlab/database.rb:298:in `commit'*/

...

Lastly, this is a proof of concept and we can work on this iteratively. There is still potential for long db transactions if there are a lot of records within another sub-relation of a sub-relation. For example: MR that has a comment with 5000 award emojis. In this case there can still be a situation of a long running transaction. However, even doing this approach should help with lowering transaction times down. We can go even further and process all of the subrelations with this approach recursively.

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by George Koltsov

Merge request reports