Skip to content
Snippets Groups Projects

Database seeds for making development almost as slow as production (for Milestoneish#issues_visible_to_user)

  • Clone with SSH
  • Clone with HTTPS
  • Embed
  • Share
    The snippet can be accessed without any authentication.
    Authored by James Edwards-Jones
    Edited
    make_dev_database_slow.rb 1.48 KiB
    • Author Contributor

      Benchmarks

      Ran benchmarks today. I don't remember it being this slow when I first ran it, perhaps because it was so much faster than previous approaches.

      2.5m issues in 10m45s

      pry(main)> Benchmark.bm{|b| b.report('bulk_insert:'){Issue.bulk_insert{|w|([project_id: p.id, author_id: author.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil} };nil
             user     system      total        real
      bulk_insert:
      312.370000 176.510000 488.880000 (645.096816)

      1.5m projects in 41m21s

      Benchmark.bm{|b| b.report('bulk_insert:'){Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}}};nil
             user     system      total        real
      bulk_insert:
      930.670000 861.590000 1792.260000 (2481.286568)
    • Author Contributor

      1.5m projects in 30m22s with columns specified

      This approach doesn't use the default values so might mean less data being sent to postgres. This gem just generates INSERT operations with sets of 500 rows, so we could probably do better by in

      [97] pry(main)> Benchmark.bm do |b|
      [97] pry(main)*   b.report('bulk_insert:') do  
      [97] pry(main)*     Project.bulk_insert(:name, :path, :creator_id, :namespace_id){|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}    
      [97] pry(main)*   end    
      [97] pry(main)* end  
             user     system      total        real
      bulk_insert:
      502.490000 749.220000 1251.710000 (1822.020671)

      1.5m projects in 4m23s using generate_series

      Much better performance this way, and could extend to use default values similar to https://github.com/jamis/bulk_insert/blob/master/lib/bulk_insert/worker.rb#L38-L44

      [136] pry(main)> Benchmark.bm do |b|
        b.report('raw_insert:') do  
          query = <<-ENDSQL    
          INSERT INTO projects (name, path, creator_id, namespace_id)
          SELECT
            md5(random()::text),
            md5(random()::text),
            #{author.id},
            #{author.namespace_id}
          FROM generate_series(1, 1500000) s(i);
          ENDSQL
          puts ActiveRecord::Base.connection.execute(query).values.join("\n");nil    
        end    
      end  
             user     system      total        real
      raw_insert:
        0.000000   0.010000   0.010000 (263.568029)
    • Author Contributor

      1.5m projects in 5m33s using generate_series and defaults

      Using the active_record-pg_generate_series gem to include default values for each column. Without the default values the UI will be broken, as values like created_at are used throughout.

      [6] pry(main)> Benchmark.bm do |b|
        b.report('generate_series:') do
          Project.insert_using_generate_series(1, 1500000) do |sql|
            sql.name = raw("md5(random()::text)")
            sql.path = raw("md5(random()::text)")
            sql.creator_id = author.id
            sql.namespace_id = author.namespace_id
          end
        end
      end
             user     system      total        real
      generate_series:  0.000000   0.000000   0.000000 (332.865800)

      1.5m projects in 2m7s by replacing md5 with number sequence

      This version saves time by naming projects seed_project_1500000 instead of calculating multiple md5 hashes. It still doesn't create associated tables like project_features, so the projects aren't very useful, but is a good staring point for slowing things down.

      [40] pry(main)> Benchmark.bm do |b|
        b.report('generate_series:') do
          Project.insert_using_generate_series(1, 1500000) do |sql|
            project_name = raw("'seed_project_' || seq")
            sql.name = project_name
            sql.path = project_name
            sql.creator_id = author.id
            sql.namespace_id = author.namespace_id
          end
        end
      end
             user     system      total        real
      generate_series:  0.000000   0.000000   0.000000 (127.377769)
      Edited by Rémy Coutable
    • Contributor

      Rake task used for testing Related Issues feature:

      desc "Bulk insert"
      task bulk_insert: :environment do
        # require 'bulk_insert' or manually paste code from Gem first
        # After doing the following my queries ran at 1/10th of the speed of staging, instead of 1/5000th of the speed.
      
        # Disable database insertion logs so speed isn't limited by ability to print to console
        old_logger = ActiveRecord::Base.logger
        ActiveRecord::Base.logger = nil
      
        # Create lots of issues
        user = FactoryGirl.create(:user, username: "user-#{rand}", email: "foo-#{rand}@example.com")
        namespace = FactoryGirl.create(:namespace, owner: user, route: nil, name: "hi-#{rand}")
        p = FactoryGirl.create(:empty_project, creator: user, namespace: namespace)
        Issue.bulk_insert{|w|([project_id: p.id, author_id: user.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil
      
        # Create lots of projects
        Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: user.id, namespace_id: user.namespace_id}}.each{|p| w.add(p)}}
      
        # Force a different/slower query plan by updating project visibility
        Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(200000).update_all(visibility_level: Gitlab::VisibilityLevel::PUBLIC)
        Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(20000).update_all(visibility_level: Gitlab::VisibilityLevel::INTERNAL)
      
        IssueLink.bulk_insert do |w| 
          count = Issue.count
          first_issues = Issue.all.take(100_000)
          first_issues.each_with_index.each do |issue, i|
            relations = ((i+1)..(i+6)).to_a.map do |x|
              target_issue = first_issues[x]
      
              if target_issue
                { source_id: issue.id, target_id: target_issue.id }
              end
            end.compact
      
            relations.each { |i| w.add(i) }
          end
        end
      
        # Reset logging
        ActiveRecord::Base.logger = old_logger
      end
    • Maintainer
    • Contributor

      Great to know @rymai, thanks!

    • Maintainer

      This info was very useful to me in working on https://gitlab.com/gitlab-org/gitlab-ee/issues/4645. Thanks @jamedjo @rymai @oswaldo!

    • Maintainer

      This is pretty much what I did to benchmark Geo selective sync queries (it's not pretty, but there you go) by hacking together various pieces from comments above:

      require 'bulk_insert'
      
      # Disable database insertion logs so speed isn't limited by ability to print to console
      old_logger = ActiveRecord::Base.logger
      ActiveRecord::Base.logger = nil
      
      def bigrand; rand(999999999999); end
      
      # Create 500 root namespaces
      num_root_namespaces = 500
      num_root_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path) }
      
      # Create 1000 descendant namespaces (randomly distributed)
      num_descendant_namespaces = 1000
      last_namespace_id = Namespace.last.id
      num_descendant_namespaces.times { path="group#{bigrand}";Group.create!(name: path, path: path, parent_id: last_namespace_id-num_root_namespaces+1+rand(num_root_namespaces)) }
      
      # Create 5000000 projects (randomly distributed)
      num_random_projects = 5000000
      total_namespaces = num_root_namespaces + num_descendant_namespaces
      last_namespace_id = Namespace.last.id
      Project.bulk_insert{|w| num_random_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id-total_namespaces+1+rand(total_namespaces)}}.each{|p| w.add(p)}}
      
      # Create 1000000 projects in the last namespace
      num_last_namespace_projects = 1000000
      last_namespace_id = Namespace.last.id
      Project.bulk_insert{|w| num_last_namespace_projects.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize(separator: "_"), namespace_id: last_namespace_id}}.each{|p| w.add(p)}}
      
      # On secondary
      
      # Create registry for last 3000000 projects
      num_new_projects = 3000000
      now = Time.now
      last_project_id = Project.last.id
      insert_attrs = ((last_project_id-num_new_projects+1)..(last_project_id)).map{|id|{created_at:now, project_id: id}}; puts('finished making insert_attrs'); Geo::ProjectRegistry.bulk_insert{|w| insert_attrs.each_with_index{|p,i| puts(i) if i%10000==0; w.add(p)}}
      
      f = Geo::ProjectRegistryFinder.new(current_node: Gitlab::Geo.current_node)
      f.count_synced_repositories
      
      ActiveRecord::Base.logger = old_logger
      
      
      
      Benchmark.measure { f.count_synced_repositories }
      Benchmark.measure { f.count_synced_wikis }
      Benchmark.measure { f.count_failed_repositories }
      Benchmark.measure { f.count_failed_wikis }
      Benchmark.measure { f.count_verified_repositories }
      Benchmark.measure { f.count_verified_wikis }
      Benchmark.measure { f.count_verification_failed_repositories }
      Benchmark.measure { f.count_verification_failed_wikis }
      Benchmark.measure { f.find_registries_to_verify(batch_size: 1000);1 }
      Benchmark.measure { f.find_projects_updated_recently(batch_size: 1000);1 }
      Benchmark.measure { f.find_unsynced_projects(batch_size: 1000);1 }

      Edit: I didn't include how to add the gem, e.g. to an Omnibus install. @toon needed this:

      /opt/gitlab/embedded/bin/gem install ffaker bulk_insert
      /opt/gitlab/embedded/bin/gem which ffaker bulk_insert
      # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/ffaker.rb
      # /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/bulk_insert.rb
      $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/ffaker-2.10.0/lib/')
      $LOAD_PATH.push('/opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/bulk_insert-1.7.0/lib/')

      And then in the rails console:

      require 'bulk_insert'
      require 'ffaker'
      Edited by Michael Kozono
    • Maintainer

      Cross linking another example: https://gitlab.com/snippets/1827527

    0% Loading or .
    You are about to add 0 people to the discussion. Proceed with caution.
    Finish editing this message first!
    Please register or to comment