Project 'gitlab-org/gitlab-ce' was moved to 'gitlab-org/gitlab-foss'. Please update any links and bookmarks that may still have the old path.

Database seeds for making development almost as slow as production (for Milestoneish#issues_visible_to_user)

Edited 1 year ago

make_dev_database_slow.rb 1.48 KiB

James Edwards-Jones @jamedjo · 8 years ago

Author Contributor
Benchmarks

Ran benchmarks today. I don't remember it being this slow when I first ran it, perhaps because it was so much faster than previous approaches.

2.5m issues in 10m45s

pry(main)> Benchmark.bm{|b| b.report('bulk_insert:'){Issue.bulk_insert{|w|([project_id: p.id, author_id: author.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil} };nil user system total real bulk_insert: 312.370000 176.510000 488.880000 (645.096816)

1.5m projects in 41m21s

Benchmark.bm{|b| b.report('bulk_insert:'){Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}}};nil user system total real bulk_insert: 930.670000 861.590000 1792.260000 (2481.286568)

## Benchmarks Ran benchmarks today. I don't remember it being this slow when I first ran it, perhaps because it was so much faster than previous approaches. ### 2.5m issues in 10m45s ```ruby pry(main)> Benchmark.bm{|b| b.report('bulk_insert:'){Issue.bulk_insert{|w|([project_id: p.id, author_id: author.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil} };nil user system total real bulk_insert: 312.370000 176.510000 488.880000 (645.096816) ``` ### 1.5m projects in 41m21s ```ruby Benchmark.bm{|b| b.report('bulk_insert:'){Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}}};nil user system total real bulk_insert: 930.670000 861.590000 1792.260000 (2481.286568) ```
James Edwards-Jones @jamedjo · 8 years ago

Author Contributor
1.5m projects in 30m22s with columns specified

This approach doesn't use the default values so might mean less data being sent to postgres. This gem just generates INSERT operations with sets of 500 rows, so we could probably do better by in

[97] pry(main)> Benchmark.bm do |b| [97] pry(main)* b.report('bulk_insert:') do [97] pry(main)* Project.bulk_insert(:name, :path, :creator_id, :namespace_id){|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}} [97] pry(main)* end [97] pry(main)* end user system total real bulk_insert: 502.490000 749.220000 1251.710000 (1822.020671)

1.5m projects in 4m23s using generate_series

Much better performance this way, and could extend to use default values similar to https://github.com/jamis/bulk_insert/blob/master/lib/bulk_insert/worker.rb#L38-L44

[136] pry(main)> Benchmark.bm do |b| b.report('raw_insert:') do query = <<-ENDSQL INSERT INTO projects (name, path, creator_id, namespace_id) SELECT md5(random()::text), md5(random()::text), #{author.id}, #{author.namespace_id} FROM generate_series(1, 1500000) s(i); ENDSQL puts ActiveRecord::Base.connection.execute(query).values.join("\n");nil end end user system total real raw_insert: 0.000000 0.010000 0.010000 (263.568029)

### 1.5m projects in 30m22s with columns specified This approach doesn't use the default values so might mean less data being sent to postgres. This gem just generates INSERT operations with sets of 500 rows, so we could probably do better by in ```ruby [97] pry(main)> Benchmark.bm do |b| [97] pry(main)* b.report('bulk_insert:') do [97] pry(main)* Project.bulk_insert(:name, :path, :creator_id, :namespace_id){|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}} [97] pry(main)* end [97] pry(main)* end user system total real bulk_insert: 502.490000 749.220000 1251.710000 (1822.020671) ``` ### 1.5m projects in 4m23s using generate_series Much better performance this way, and could extend to use default values similar to https://github.com/jamis/bulk_insert/blob/master/lib/bulk_insert/worker.rb#L38-L44 ```ruby [136] pry(main)> Benchmark.bm do |b| b.report('raw_insert:') do query = <<-ENDSQL INSERT INTO projects (name, path, creator_id, namespace_id) SELECT md5(random()::text), md5(random()::text), #{author.id}, #{author.namespace_id} FROM generate_series(1, 1500000) s(i); ENDSQL puts ActiveRecord::Base.connection.execute(query).values.join("\n");nil end end user system total real raw_insert: 0.000000 0.010000 0.010000 (263.568029) ```
James Edwards-Jones @jamedjo · 8 years ago

Author Contributor
1.5m projects in 5m33s using generate_series and defaults

Using the active_record-pg_generate_series gem to include default values for each column. Without the default values the UI will be broken, as values like created_at are used throughout.

[6] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| sql.name = raw("md5(random()::text)") sql.path = raw("md5(random()::text)") sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (332.865800)

1.5m projects in 2m7s by replacing md5 with number sequence

This version saves time by naming projects seed_project_1500000 instead of calculating multiple md5 hashes. It still doesn't create associated tables like project_features, so the projects aren't very useful, but is a good staring point for slowing things down.

[40] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| project_name = raw("'seed_project_' || seq") sql.name = project_name sql.path = project_name sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (127.377769)

Edited 8 years ago by Rémy Coutable

### 1.5m projects in 5m33s using generate_series and defaults Using the [active_record-pg_generate_series gem](https://github.com/ryu39/active_record-pg_generate_series) to include default values for each column. Without the default values the UI will be broken, as values like created_at are used throughout. ```ruby [6] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| sql.name = raw("md5(random()::text)") sql.path = raw("md5(random()::text)") sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (332.865800) ``` ### 1.5m projects in 2m7s by replacing md5 with number sequence This version saves time by naming projects `seed_project_1500000` instead of calculating multiple `md5` hashes. It still doesn't create associated tables like `project_features`, so the projects aren't very useful, but is a good staring point for slowing things down. ```ruby [40] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| project_name = raw("'seed_project_' || seq") sql.name = project_name sql.path = project_name sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (127.377769) ```

Oswaldo Ferreira @oswaldo · 7 years ago

Rake task used for testing Related Issues feature:

desc "Bulk insert"
task bulk_insert: :environment do
  # require 'bulk_insert' or manually paste code from Gem first
  # After doing the following my queries ran at 1/10th of the speed of staging, instead of 1/5000th of the speed.

  # Disable database insertion logs so speed isn't limited by ability to print to console
  old_logger = ActiveRecord::Base.logger
  ActiveRecord::Base.logger = nil

  # Create lots of issues
  user = FactoryGirl.create(:user, username: "user-#{rand}", email: "foo-#{rand}@example.com")
  namespace = FactoryGirl.create(:namespace, owner: user, route: nil, name: "hi-#{rand}")
  p = FactoryGirl.create(:empty_project, creator: user, namespace: namespace)
  Issue.bulk_insert{|w|([project_id: p.id, author_id: user.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil

  # Create lots of projects
  Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: user.id, namespace_id: user.namespace_id}}.each{|p| w.add(p)}}

  # Force a different/slower query plan by updating project visibility
  Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(200000).update_all(visibility_level: Gitlab::VisibilityLevel::PUBLIC)
  Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(20000).update_all(visibility_level: Gitlab::VisibilityLevel::INTERNAL)

  IssueLink.bulk_insert do |w| 
    count = Issue.count
    first_issues = Issue.all.take(100_000)
    first_issues.each_with_index.each do |issue, i|
      relations = ((i+1)..(i+6)).to_a.map do |x|
        target_issue = first_issues[x]

        if target_issue
          { source_id: issue.id, target_id: target_issue.id }
        end
      end.compact

      relations.each { |i| w.add(i) }
    end
  end

  # Reset logging
  ActiveRecord::Base.logger = old_logger
end

Rémy Coutable @rymai · 7 years ago

Maintainer

@oswaldo You might be interested in https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 (I've started working on that: https://gitlab.com/gitlab-org/gitlab-ce/compare/master...28149-improve-seed)...

@oswaldo You might be interested in https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 (I've started working on that: https://gitlab.com/gitlab-org/gitlab-ce/compare/master...28149-improve-seed)...
Oswaldo Ferreira @oswaldo · 7 years ago

Contributor

Great to know @rymai, thanks!

Great to know @rymai, thanks!
Michael Kozono @mkozono · 6 years ago

Maintainer

This info was very useful to me in working on https://gitlab.com/gitlab-org/gitlab-ee/issues/4645. Thanks @jamedjo @rymai @oswaldo!

This info was very useful to me in working on https://gitlab.com/gitlab-org/gitlab-ee/issues/4645. Thanks @jamedjo @rymai @oswaldo!
Michael Kozono @mkozono · 6 years ago

Maintainer
This is pretty much what I did to benchmark Geo selective sync queries (it's not pretty, but there you go) by hacking together various pieces from comments above:

require 'bulk_insert' # Disable database insertion logs so speed isn't limited by ability to print to console old_logger = ActiveRecord::Base.logger ActiveRecord::Base.logger = nil def bigrand; rand(999999999999); end # Create 500 root namespaces num_root_namespaces = 500 num_root_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path) } # Create 1000 descendant namespaces (randomly distributed) num_descendant_namespaces = 1000 last_namespace_id = Namespace.last.id num_descendant_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path, parent_id: last_namespace_id-num_root_namespaces+1+rand(num_root_namespaces)) } # Create 5000000 projects (randomly distributed) num_random_projects = 5000000 total_namespaces = num_root_namespaces + num_descendant_namespaces last_namespace_id = Namespace.last.id Project.bulk_insert{|w| num_random_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id-total_namespaces+1+rand(total_namespaces)}}.each{|p| w.add(p)}} # Create 1000000 projects in the last namespace num_last_namespace_projects = 1000000 last_namespace_id = Namespace.last.id Project.bulk_insert{|w| num_last_namespace_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id}}.each{|p| w.add(p)}} # On secondary # Create registry for last 3000000 projects num_new_projects = 3000000 now = Time.now last_project_id = Project.last.id insert_attrs = ((last_project_id-num_new_projects+1)..(last_project_id)).map{|id|{created_at:now, project_id: id}}; puts('finished making insert_attrs'); Geo::ProjectRegistry.bulk_insert{|w| insert_attrs.each_with_index{|p,i| puts(i) if i%10000==0; w.add(p)}} f = Geo::ProjectRegistryFinder.new(current_node: Gitlab::Geo.current_node) f.count_synced_repositories ActiveRecord::Base.logger = old_logger Benchmark.measure { f.count_synced_repositories } Benchmark.measure { f.count_synced_wikis } Benchmark.measure { f.count_failed_repositories } Benchmark.measure { f.count_failed_wikis } Benchmark.measure { f.count_verified_repositories } Benchmark.measure { f.count_verified_wikis } Benchmark.measure { f.count_verification_failed_repositories } Benchmark.measure { f.count_verification_failed_wikis } Benchmark.measure { f.find_registries_to_verify(batch_size: 1000);1 } Benchmark.measure { f.find_projects_updated_recently(batch_size: 1000);1 } Benchmark.measure { f.find_unsynced_projects(batch_size: 1000);1 }

Edit: I didn't include how to add the gem, e.g. to an Omnibus install. @toon needed this:

/opt/gitlab/embedded/bin/gem install ffaker bulk_insert /opt/gitlab/embedded/bin/gem which ffaker bulk_insert # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/ffaker.rb # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/bulk_insert.rb $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/') $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/')

And then in the rails console:

require 'bulk_insert' require 'ffaker'

Edited 6 years ago by Michael Kozono

This is pretty much what I did to benchmark Geo selective sync queries (it's not pretty, but there you go) by hacking together various pieces from comments above: ```ruby require 'bulk_insert' # Disable database insertion logs so speed isn't limited by ability to print to console old_logger = ActiveRecord::Base.logger ActiveRecord::Base.logger = nil def bigrand; rand(999999999999); end # Create 500 root namespaces num_root_namespaces = 500 num_root_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path) } # Create 1000 descendant namespaces (randomly distributed) num_descendant_namespaces = 1000 last_namespace_id = Namespace.last.id num_descendant_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path, parent_id: last_namespace_id-num_root_namespaces+1+rand(num_root_namespaces)) } # Create 5000000 projects (randomly distributed) num_random_projects = 5000000 total_namespaces = num_root_namespaces + num_descendant_namespaces last_namespace_id = Namespace.last.id Project.bulk_insert{|w| num_random_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id-total_namespaces+1+rand(total_namespaces)}}.each{|p| w.add(p)}} # Create 1000000 projects in the last namespace num_last_namespace_projects = 1000000 last_namespace_id = Namespace.last.id Project.bulk_insert{|w| num_last_namespace_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id}}.each{|p| w.add(p)}} # On secondary # Create registry for last 3000000 projects num_new_projects = 3000000 now = Time.now last_project_id = Project.last.id insert_attrs = ((last_project_id-num_new_projects+1)..(last_project_id)).map{|id|{created_at:now, project_id: id}}; puts('finished making insert_attrs'); Geo::ProjectRegistry.bulk_insert{|w| insert_attrs.each_with_index{|p,i| puts(i) if i%10000==0; w.add(p)}} f = Geo::ProjectRegistryFinder.new(current_node: Gitlab::Geo.current_node) f.count_synced_repositories ActiveRecord::Base.logger = old_logger Benchmark.measure { f.count_synced_repositories } Benchmark.measure { f.count_synced_wikis } Benchmark.measure { f.count_failed_repositories } Benchmark.measure { f.count_failed_wikis } Benchmark.measure { f.count_verified_repositories } Benchmark.measure { f.count_verified_wikis } Benchmark.measure { f.count_verification_failed_repositories } Benchmark.measure { f.count_verification_failed_wikis } Benchmark.measure { f.find_registries_to_verify(batch_size: 1000);1 } Benchmark.measure { f.find_projects_updated_recently(batch_size: 1000);1 } Benchmark.measure { f.find_unsynced_projects(batch_size: 1000);1 } ``` *Edit:* I didn't include how to add the gem, e.g. to an Omnibus install. @toon needed this: ```shell /opt/gitlab/embedded/bin/gem install ffaker bulk_insert /opt/gitlab/embedded/bin/gem which ffaker bulk_insert # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/ffaker.rb # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/bulk_insert.rb $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/') $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/') ``` And then in the rails console: ```ruby require 'bulk_insert' require 'ffaker' ```
Michael Kozono @mkozono · 6 years ago

Maintainer

Cross linking another example: https://gitlab.com/snippets/1827527

Cross linking another example: https://gitlab.com/snippets/1827527

Please register or to comment

Database seeds for making development almost as slow as production (for Milestoneish#issues_visible_to_user)

Benchmarks

2.5m issues in 10m45s

1.5m projects in 41m21s

1.5m projects in 30m22s with columns specified

1.5m projects in 4m23s using generate_series

1.5m projects in 5m33s using generate_series and defaults

1.5m projects in 2m7s by replacing md5 with number sequence