Skip to content

Distributed HyperLogLog batch counter

Mikołaj Wawrzyniak requested to merge mwaw/distributed_hll_batch_counter into master

What does this MR do?

Inspired with pure SQL HLL distinct count approach described at https://www.sisense.com/blog/hyperloglog-in-pure-sql/

This MR implements PoC for distributed HLL batch counter. Baseline idea is as follow:

  1. Create HLL buckets at DB level for each batch
  2. Fetch HLL buckets from application in a loop
  3. Merge HLL buckets at application layer
  4. Estimate relation size with HLL buckets at application layer

Main gains from distributed approach

  1. Fixed upper limit of records that can be included into each analyzed batch, that assures constant maximal execution time for each batch of given size (see discussion at #230438 (comment 421212723) and #230438 (comment 421212723) for more detail)
  2. Constant small number of fetched data by application (512 rows with 2 integers attributes)
  3. Ability to merged different HLL buckets later on (unions and intersections between different sql metrics)

Why this MR is required

This MR was motivated by following things:

1. (Not usage ping specific): Problem with BatchCount not handling well distinct counting on non unique values for big relation

I've described it in details earlier at !45673 (comment 444843665) It is true that I've spotted this problem for some of usage ping metrics, but it is not limited to it, and can occur everywhere BatchCount is used, when some pre existing conditions are met

2. (Usage ping specific) Calculating unions and intersections of usage ping metrics

Currently we are working on providing information how many distinct entries were recorded for all of the selected metrics subset, or how many distinct entries was recorded that was present in all of the listed metrics subset !46146 (merged).

Although for pure SQL counted metrics it would be possible to use UNION and than run distinct count on that relation, we are also heaving metrics that are recorded through Redis HLL algorithm with https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/usage_data_counters/hll_redis_counter.rb Because of that it is impossible to use SQL UNION to obtain union and intersection across multiple data sources.

In the long run, we would need to replace Redis HLL algorithm with internal HLL implementation to have HLL buckets that are compatible with each other for every data source and calculate unions and intersection with them. This implementation is a first step towards that goal. Another advantage that this implementation has over UNION version is that once built HLL buckets can be used to obtain multiple unions and intersections, while resulting value from UNION distinct count can not.

Alternative approaches were evaluated at #270444 (closed)

Benchmark

I've compared existing Gitlab::Database::BatchCount#batch_distinct_count with newly proposed with following benchmark, on relation (Ci::Builds) that contains 20 010 877 records in total and 2 570 000 fulfilling filter condition.

require 'benchmark/ips'

Benchmark.ips do |x|
  x.report('hll distributed batch counting') do
    Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
  end

  x.report('batch counting') do
    Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
  end

  x.compare!
end
Calculating -------------------------------------
hll distributed batch counting
                         1.000  i/100ms
      batch counting     1.000  i/100ms
-------------------------------------------------
hll distributed batch counting
                          0.040  (± 0.0%) i/s -      1.000  in  25.089552s
      batch counting      0.039  (± 0.0%) i/s -      1.000  in  25.643170s

Comparison:
hll distributed batch counting:        0.0 i/s
      batch counting:        0.0 i/s - 1.02x slower

=> #<Benchmark::IPS::Report:0x00007f88788e6ce8
 @data=nil,
 @entries=[#<Benchmark::IPS::Report::Entry:0x00007f886cc86698 @ips=0.03985722871678709, @ips_sd=0, @iterations=1, @label="hll distributed batch counting", @measurement_cycle=1, @microseconds=25089551.6872406, @show_total_time=true>, #<Benchmark::IPS::Report::Entry:0x00007f887b247ab0 @ips=0.03899673888536357, @ips_sd=0, @iterations=1, @label="batch counting", @measurement_cycle=1, @microseconds=25643169.87991333, @show_total_time=true>]>

Results returned by each of counters were as following

[9] pry(main)> Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2601007
[10] pry(main)> Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2570000

HyperLogLog algorith estimation was off by ~1.2%

Accuracy estimation

Size HyperLogLog Exact count Error Rate %
10 10.487024297669207 10 4.870242976692069
100 105.30880048123788 100 5.308800481237881
1000 1007.2682972730778 1000 0.7268297273077735
10000 9689 10000 -3.1099999999999994
100000 102621 100000 2.6210000000000093
1000000c 996113 1000000c -0.38870000000000005
10000000 10070622 10000000 0.7062200000000018

Accuracy estimation was done with following snippet

require './spec/support/sidekiq_middleware'

class Gitlab::Seeder::Users
  include ActionView::Helpers::NumberHelper

  MAX_BATCH_SIZE = 100_000

  attr_reader :opts

  def initialize(mass_users_count)
    @mass_users_count = mass_users_count
  end

  def seed!
    Sidekiq::Testing.inline! do
      create_mass_users!
    end
  end

  private

  def create_mass_users!
    encrypted_password = Devise::Encryptor.digest(User, random_password)

    batch_size = @mass_users_count > MAX_BATCH_SIZE ? MAX_BATCH_SIZE : @mass_users_count
    batch_start = 1
    # 10
    while batch_start < @mass_users_count do
      ActiveRecord::Base.connection.execute <<~SQL
        INSERT INTO users (username, name, email, confirmed_at, projects_limit, encrypted_password)
        SELECT
          '#{Gitlab::Seeder::MASS_INSERT_USER_START}' || seq,
          'Seed user ' || seq,
          'seed_user' || seq || '@example.com',
          to_timestamp(seq),
          #{@mass_users_count},
          '#{encrypted_password}'
        FROM generate_series(#{batch_start}, #{batch_start + batch_size - 1}) AS seq
      SQL
      batch_start += batch_size
    end
  end

  def random_password
    @random_password ||= SecureRandom.hex.slice(0,16)
  end
end


vals = []
[10, 100, 1_000, 10_000,  100_000, 1_000_000, 10_000_000].each do |size|
  Gitlab::Seeder::Users.new(size).seed!
  relation = User
  hll_est = Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(relation, :email)
  exact_count =  size
  vals << [size, hll_est, exact_count]
  ActiveRecord::Base.connection.execute('TRUNCATE users CASCADE')
end

template = "| %{size} | %{hll} | %{exact} | %{err} | \n"

print template % { size: 'Size', hll: 'HyperLogLog', exact: 'Exact count', err: 'Error Rate' }

vals.each do |size, hll_est, exact_count|
  print template % { size: size, hll: hll_est, exact: exact_count, err: (hll_est / exact_count.to_f) * 100 - 100 }
end

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Mikołaj Wawrzyniak

Merge request reports