Skip to content
  • Author Contributor

    Benchmarks

    Ran benchmarks today. I don't remember it being this slow when I first ran it, perhaps because it was so much faster than previous approaches.

    2.5m issues in 10m45s

    pry(main)> Benchmark.bm{|b| b.report('bulk_insert:'){Issue.bulk_insert{|w|([project_id: p.id, author_id: author.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil} };nil
           user     system      total        real
    bulk_insert:
    312.370000 176.510000 488.880000 (645.096816)

    1.5m projects in 41m21s

    Benchmark.bm{|b| b.report('bulk_insert:'){Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}}};nil
           user     system      total        real
    bulk_insert:
    930.670000 861.590000 1792.260000 (2481.286568)
  • Author Contributor

    1.5m projects in 30m22s with columns specified

    This approach doesn't use the default values so might mean less data being sent to postgres. This gem just generates INSERT operations with sets of 500 rows, so we could probably do better by in

    [97] pry(main)> Benchmark.bm do |b|
    [97] pry(main)*   b.report('bulk_insert:') do  
    [97] pry(main)*     Project.bulk_insert(:name, :path, :creator_id, :namespace_id){|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}    
    [97] pry(main)*   end    
    [97] pry(main)* end  
           user     system      total        real
    bulk_insert:
    502.490000 749.220000 1251.710000 (1822.020671)

    1.5m projects in 4m23s using generate_series

    Much better performance this way, and could extend to use default values similar to https://github.com/jamis/bulk_insert/blob/master/lib/bulk_insert/worker.rb#L38-L44

    [136] pry(main)> Benchmark.bm do |b|
      b.report('raw_insert:') do  
        query = <<-ENDSQL    
        INSERT INTO projects (name, path, creator_id, namespace_id)
        SELECT
          md5(random()::text),
          md5(random()::text),
          #{author.id},
          #{author.namespace_id}
        FROM generate_series(1, 1500000) s(i);
        ENDSQL
        puts ActiveRecord::Base.connection.execute(query).values.join("\n");nil    
      end    
    end  
           user     system      total        real
    raw_insert:
      0.000000   0.010000   0.010000 (263.568029)
  • Author Contributor

    1.5m projects in 5m33s using generate_series and defaults

    Using the active_record-pg_generate_series gem to include default values for each column. Without the default values the UI will be broken, as values like created_at are used throughout.

    [6] pry(main)> Benchmark.bm do |b|
      b.report('generate_series:') do
        Project.insert_using_generate_series(1, 1500000) do |sql|
          sql.name = raw("md5(random()::text)")
          sql.path = raw("md5(random()::text)")
          sql.creator_id = author.id
          sql.namespace_id = author.namespace_id
        end
      end
    end
           user     system      total        real
    generate_series:  0.000000   0.000000   0.000000 (332.865800)

    1.5m projects in 2m7s by replacing md5 with number sequence

    This version saves time by naming projects seed_project_1500000 instead of calculating multiple md5 hashes. It still doesn't create associated tables like project_features, so the projects aren't very useful, but is a good staring point for slowing things down.

    [40] pry(main)> Benchmark.bm do |b|
      b.report('generate_series:') do
        Project.insert_using_generate_series(1, 1500000) do |sql|
          project_name = raw("'seed_project_' || seq")
          sql.name = project_name
          sql.path = project_name
          sql.creator_id = author.id
          sql.namespace_id = author.namespace_id
        end
      end
    end
           user     system      total        real
    generate_series:  0.000000   0.000000   0.000000 (127.377769)
    Edited by Rémy Coutable
  • Contributor

    Rake task used for testing Related Issues feature:

    desc "Bulk insert"
    task bulk_insert: :environment do
      # require 'bulk_insert' or manually paste code from Gem first
      # After doing the following my queries ran at 1/10th of the speed of staging, instead of 1/5000th of the speed.
    
      # Disable database insertion logs so speed isn't limited by ability to print to console
      old_logger = ActiveRecord::Base.logger
      ActiveRecord::Base.logger = nil
    
      # Create lots of issues
      user = FactoryGirl.create(:user, username: "user-#{rand}", email: "foo-#{rand}@example.com")
      namespace = FactoryGirl.create(:namespace, owner: user, route: nil, name: "hi-#{rand}")
      p = FactoryGirl.create(:empty_project, creator: user, namespace: namespace)
      Issue.bulk_insert{|w|([project_id: p.id, author_id: user.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil
    
      # Create lots of projects
      Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: user.id, namespace_id: user.namespace_id}}.each{|p| w.add(p)}}
    
      # Force a different/slower query plan by updating project visibility
      Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(200000).update_all(visibility_level: Gitlab::VisibilityLevel::PUBLIC)
      Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(20000).update_all(visibility_level: Gitlab::VisibilityLevel::INTERNAL)
    
      IssueLink.bulk_insert do |w| 
        count = Issue.count
        first_issues = Issue.all.take(100_000)
        first_issues.each_with_index.each do |issue, i|
          relations = ((i+1)..(i+6)).to_a.map do |x|
            target_issue = first_issues[x]
    
            if target_issue
              { source_id: issue.id, target_id: target_issue.id }
            end
          end.compact
    
          relations.each { |i| w.add(i) }
        end
      end
    
      # Reset logging
      ActiveRecord::Base.logger = old_logger
    end
  • Contributor

    Great to know @rymai, thanks!

  • Maintainer

    This info was very useful to me in working on https://gitlab.com/gitlab-org/gitlab-ee/issues/4645. Thanks @jamedjo @rymai @oswaldo!

  • Maintainer

    This is pretty much what I did to benchmark Geo selective sync queries (it's not pretty, but there you go) by hacking together various pieces from comments above:

    require 'bulk_insert'
    
    # Disable database insertion logs so speed isn't limited by ability to print to console
    old_logger = ActiveRecord::Base.logger
    ActiveRecord::Base.logger = nil
    
    def bigrand; rand(999999999999); end
    
    # Create 500 root namespaces
    num_root_namespaces = 500
    num_root_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path) }
    
    # Create 1000 descendant namespaces (randomly distributed)
    num_descendant_namespaces = 1000
    last_namespace_id = Namespace.last.id
    num_descendant_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path, parent_id: last_namespace_id-num_root_namespaces+1+rand(num_root_namespaces)) }
    
    # Create 5000000 projects (randomly distributed)
    num_random_projects = 5000000
    total_namespaces = num_root_namespaces + num_descendant_namespaces
    last_namespace_id = Namespace.last.id
    Project.bulk_insert{|w| num_random_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id-total_namespaces+1+rand(total_namespaces)}}.each{|p| w.add(p)}}
    
    # Create 1000000 projects in the last namespace
    num_last_namespace_projects = 1000000
    last_namespace_id = Namespace.last.id
    Project.bulk_insert{|w| num_last_namespace_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id}}.each{|p| w.add(p)}}
    
    # On secondary
    
    # Create registry for last 3000000 projects
    num_new_projects = 3000000
    now = Time.now
    last_project_id = Project.last.id
    insert_attrs = ((last_project_id-num_new_projects+1)..(last_project_id)).map{|id|{created_at:now, project_id: id}}; puts('finished making insert_attrs'); Geo::ProjectRegistry.bulk_insert{|w| insert_attrs.each_with_index{|p,i| puts(i) if i%10000==0; w.add(p)}}
    
    f = Geo::ProjectRegistryFinder.new(current_node: Gitlab::Geo.current_node)
    f.count_synced_repositories
    
    ActiveRecord::Base.logger = old_logger
    
    
    
    Benchmark.measure { f.count_synced_repositories }
    Benchmark.measure { f.count_synced_wikis }
    Benchmark.measure { f.count_failed_repositories }
    Benchmark.measure { f.count_failed_wikis }
    Benchmark.measure { f.count_verified_repositories }
    Benchmark.measure { f.count_verified_wikis }
    Benchmark.measure { f.count_verification_failed_repositories }
    Benchmark.measure { f.count_verification_failed_wikis }
    Benchmark.measure { f.find_registries_to_verify(batch_size: 1000);1 }
    Benchmark.measure { f.find_projects_updated_recently(batch_size: 1000);1 }
    Benchmark.measure { f.find_unsynced_projects(batch_size: 1000);1 }

    Edit: I didn't include how to add the gem, e.g. to an Omnibus install. @toon needed this:

    /opt/gitlab/embedded/bin/gem install ffaker bulk_insert
    /opt/gitlab/embedded/bin/gem which ffaker bulk_insert
    # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/ffaker.rb
    # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/bulk_insert.rb
    $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/')
    $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/')

    And then in the rails console:

    require 'bulk_insert'
    require 'ffaker'
    Edited by Michael Kozono
  • Maintainer

    Cross linking another example: https://gitlab.com/snippets/1827527

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment