Use #in_batches in Project Export StreamingSerializer (!35125) · Merge requests · GitLab.org / GitLab

George Koltsov requested to merge georgekoltsov/export-in-batches-instead-of-for-each into master Jun 23, 2020

What does this MR do?

This MR attempts to fix an issue with Project Export statement timeouts (like this one #212355 (closed)). Currently big project exports can fail with ActiveRecord::QueryCanceled: PG::QueryCanceled: ERROR: canceling statement due to statement timeout.

Reason (described here #212355 (comment 364049215)) is below.

records.find_each(batch_size: batch_size) (which is project.issues.find_each(batch_size: 20)) from https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/import_export/json/streaming_serializer.rb#L72 runs the following query which times out. I was able to reproduce this on staging and prod.

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = <PROJECT_ID> ORDER BY "issues"."id" ASC LIMIT 20

Our legacy implementation on the other hand, uses record.in_batches(of: batch_size) and does not time out:

SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = <PROJECT_ID> ORDER BY "issues"."id" ASC LIMIT 20

.find_each uses .in_batches under the hood but with load: true flag.

A few potential solutions to this:

For each project relation, add project_id, id index as Active Record batch ordering can't be changed as per documentation ( see https://api.rubyonrails.org/classes/ActiveRecord/Batches.html )
Changing .find_each back to using .in_batches. This will require a slight change to batch handling

I tried a few variations on a fresh console session on staging. This is with merge_requests relation, statement timeouts happen on issues relation on production, but it looks like more than 1 is affected.

SELECT issues.id first:

[ gstg ] production> p.merge_requests.in_batches(of: 20) { |b| p b; break }

D, [2020-06-19T12:38:24.033353 #18536] DEBUG -- :    (17.5ms)  SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/

D, [2020-06-19T12:38:24.037550 #18536] DEBUG -- :   MergeRequest Load (2.9ms)  SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 AND "merge_requests"."id" IN (5667901, 5667902, 5667903, 5667904, 5667905, 5667906, 5667907, 5667908, 5667909, 5667910, 5667911, 5667912, 5667913, 5667914, 5667915, 5667916, 5667917, 5667918, 5667919, 5667920) LIMIT 11 /*application:console*/
#<ActiveRecord::AssociationRelation [...]>



[ gstg ] production> p.merge_requests.find_each(batch_size: 20) { |b| p b; break }

D, [2020-06-19T12:40:00.719287 #18536] DEBUG -- :   MergeRequest Load (10003.6ms)  SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/

ActiveRecord::QueryCanceled (PG::QueryCanceled: ERROR:  canceling statement due to statement timeout)

SELECT issues.* first:

[ gstg ] production> p.merge_requests.find_each(batch_size: 20) { |b| p b; break }
D, [2020-06-19T12:42:31.753269 #19218] DEBUG -- :   MergeRequest Load (10004.6ms)  SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/

ActiveRecord::QueryCanceled (PG::QueryCanceled: ERROR:  canceling statement due to statement timeout)


[ gstg ] production> p.merge_requests.in_batches(of: 20) { |b| p b; break }
D, [2020-06-19T12:42:50.694315 #19218] DEBUG -- :    (13.3ms)  SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/

D, [2020-06-19T12:42:50.700828 #19218] DEBUG -- :   MergeRequest Load (5.3ms)  SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 AND "merge_requests"."id" IN (5667901, 5667902, 5667903, 5667904, 5667905, 5667906, 5667907, 5667908, 5667909, 5667910, 5667911, 5667912, 5667913, 5667914, 5667915, 5667916, 5667917, 5667918, 5667919, 5667920) LIMIT 11 /*application:console*/

#<ActiveRecord::AssociationRelation [...]>
=> nil

Query plans:

[ gstg ] production> pp ActiveRecord::Base.connection.execute('explain SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20').to_a

D, [2020-06-19T12:57:22.094174 #19218] DEBUG -- :    (3.3ms)  explain SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/
[{"QUERY PLAN"=>"Limit  (cost=198.44..198.49 rows=20 width=4)"},
 {"QUERY PLAN"=>"  ->  Sort  (cost=198.44..206.00 rows=3024 width=4)"},
 {"QUERY PLAN"=>"        Sort Key: id"},
 {"QUERY PLAN"=>
   "        ->  Index Only Scan using index_merge_requests_on_tp_id_and_merge_commit_sha_and_id on merge_requests  (cost=0.56..117.98 rows=3024 width=4)"},
 {"QUERY PLAN"=>"              Index Cond: (target_project_id = 4481077)"}]

[ gstg ] production> pp ActiveRecord::Base.connection.execute('explain SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20').to_a
D, [2020-06-19T12:57:57.177757 #19218] DEBUG -- :    (3.5ms)  explain SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4481077 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/
[{"QUERY PLAN"=>"Limit  (cost=0.43..3159.09 rows=20 width=748)"},
 {"QUERY PLAN"=>
   "  ->  Index Scan using merge_requests_pkey on merge_requests  (cost=0.43..477590.22 rows=3024 width=748)"},
 {"QUERY PLAN"=>"        Filter: (target_project_id = 4481077)"}]

-> Index Scan using merge_requests_pkey on merge_requests (cost=0.43..477590.22 rows=3024 width=748)"} seems to be very costly? This is staging on merge requests relation, but is similar problem on prod (I ran similar tests in database-lab Slack channel) and saw similar figures.

I populated my own project MRs to 3k mark and here's explain analyze:

[ gstg ] production> pp ActiveRecord::Base.connection.execute('explain analyze SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4402162 ORDER BY "merge_requests"."id" ASC LIMIT 20').to_a
D, [2020-06-19T14:14:29.736300 #24979] DEBUG -- :    (3990.1ms)  explain analyze SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 4402162 ORDER BY "merge_requests"."id" ASC LIMIT 20 /*application:console*/
[{"QUERY PLAN"=>
   "Limit  (cost=0.43..2554.09 rows=20 width=747) (actual time=3986.907..3986.929 rows=20 loops=1)"},
 {"QUERY PLAN"=>
   "  ->  Index Scan using merge_requests_pkey on merge_requests  (cost=0.43..488004.30 rows=3822 width=747) (actual time=3986.906..3986.925 rows=20 loops=1)"},
 {"QUERY PLAN"=>"        Filter: (target_project_id = 4402162)"},
 {"QUERY PLAN"=>"        Rows Removed by Filter: 4754309"},
 {"QUERY PLAN"=>"Planning Time: 0.346 ms"},
 {"QUERY PLAN"=>"Execution Time: 3986.972 ms"}]

Could this be due to table size?

Mentions #212355 (closed)

Screenshots

Does this MR meet the acceptance criteria?

Conformity

[-] Changelog entry
[-] Documentation (if required)
[-] Code review guidelines
[-] Merge request performance guidelines
Style guides
[-] Database guides
[-] Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Apr 26, 2023 by 🤖 GitLab Bot 🤖

Use #in_batches in Project Export StreamingSerializer