Skip to content

Optimize CSV export to use find_each

Stan Hu requested to merge sh-export-csv-each-batch into master

As seen in #360012, exporting issues to CSV using find_each can hit a database timeout since an ORDER BY id is needed. Use each_batch instead.

This commit adds a feature flag, export_csv_preload_in_batches, that will preload associations in the batch load itself to reduce the work the database needs to load all issues.

How to test locally

  1. Create lots of issues for a project:

    project = Project.last
    user = User.last
    
    100000.times do |index|
      issue = Issue.new(author: user, description: 'test issue', title: "test issue #{index}", project: project)
      issue.save!
    end
  2. Attempt to export the CSV.

    service = Issues::ExportCsvService.new(project.issues, project); nil
    output = File.open("/tmp/test.csv", "w")
    output.write(service.csv_data); nil
    output.close
  3. Before the feature flag is enabled, you'll see this query:

      Issue Load (1.3ms)  SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 52 ORDER BY "issues"."id" ASC LIMIT 1000 /*application:console,db_config_name:main,console_hostname:jet-arm.local,console_username:stanhu,line:/lib/csv_builder.rb:85:in `each'*/
  4. Enable the feature flag:

    Feature.enable(:export_csv_preload_in_batches)
  5. After the feature flag is enabled, you'll see these queries:

    Issue Load (3.4ms)  SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 52 ORDER BY "issues"."id" ASC, "issues"."created_at" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:jet-arm.local,console_username:stanhu,line:/app/models/concerns/each_batch.rb:62:in `each_batch'*/
    Issue Load (3.0ms)  SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 52 AND "issues"."id" >= 503 ORDER BY "issues"."id" ASC, "issues"."created_at" ASC LIMIT 1 OFFSET 1000 /*application:console,db_config_name:main,console_hostname:jet-arm.local,console_username:stanhu,line:/app/models/concerns/each_batch.rb:81:in `block in each_batch'*/
    Issue Load (1.4ms)  SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 52 AND "issues"."id" >= 503 AND "issues"."id" < 1503 /*application:console,db_config_name:main,console_hostname:jet-arm.local,console_username:stanhu,line:/lib/csv_builder.rb:82:in `block in each'*/

Database Lab comparison with gitlab-org/gitlab issues

Original query

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 278964 ORDER BY "issues"."id" ASC LIMIT 1000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/14996/commands/52157 shows 8 minutes!

Time: 8.086 min
  - planning: 6.708 ms
  - execution: 8.086 min
    - I/O read: 20.263 min
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 224924 (~1.70 GiB) from the buffer pool
  - reads: 930960 (~7.10 GiB) from the OS file cache, including disk I/O
  - dirtied: 2724 (~21.30 MiB)
  - writes: 0

New each_batch queries:

Start query

https://gitlab.com/gitlab-org/gitlab/-/blob/f50075762cf33d3841b88bb191770776b07ede77/app/models/concerns/each_batch.rb#L62

SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 278964 ORDER BY "issues"."id" ASC, "issues"."created_at" ASC LIMIT 1

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/14996/commands/52155

Time: 2.129 s
  - planning: 0.444 ms
  - execution: 2.128 s
    - I/O read: 5.569 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 116824 (~912.70 MiB) from the buffer pool
  - reads: 7334 (~57.30 MiB) from the OS file cache, including disk I/O
  - dirtied: 1699 (~13.30 MiB)
  - writes: 0

Stop: https://gitlab.com/gitlab-org/gitlab/-/blob/f50075762cf33d3841b88bb191770776b07ede77/app/models/concerns/each_batch.rb#L81

SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."id" >= 382515 ORDER BY "issues"."id" ASC, "issues"."created_at" ASC LIMIT 1 OFFSET 1000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/14996/commands/52156

Time: 634.853 ms
  - planning: 0.704 ms
  - execution: 634.149 ms
    - I/O read: 1.393 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 116158 (~907.50 MiB) from the buffer pool
  - reads: 6176 (~48.30 MiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Batched query

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."id" >= 382515 AND "issues"."id" < 383515 ORDER BY "issues"."id" ASC

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/14996/commands/52158

Time: 34.173 ms
  - planning: 2.983 ms
  - execution: 31.190 ms
    - I/O read: 28.256 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 55 (~440.00 KiB) from the buffer pool
  - reads: 461 (~3.60 MiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0
Edited by Stan Hu

Merge request reports