Investigation - Usage Ping returns -1 for some of Redis HLL counters

Opening an issue to add information about Usage Ping bug

Problem description

Incorrect weekly keys for Redis HLL counters

Technical inside

  def weekly_redis_keys(events:, start_date:, end_date:, context: '')
    weeks = end_date.to_date.cweek - start_date.to_date.cweek
    weeks = 1 if weeks == 0

    (0..(weeks - 1)).map do |week_increment|
      events.map { |event| redis_key(event, start_date + week_increment * 7.days, context) }
    end.flatten
  end
  • This code leads to weekly_redis_keys method returning empty array []

When we get the data we use the hardening method for Redis

# https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/usage_data_counters/hll_redis_counter.rb#L237

redis_usage_data{ Gitlab::Redis::HLL.count(keys: []) }

# Hardening method catch 2 type of errors https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/utils/usage_data.rb#L155

def redis_usage_counter
  yield
rescue ::Redis::CommandError, Gitlab::UsageDataCounters::BaseCounter::UnknownEvent
  FALLBACK
end

Checking the logic behaviour in testing, development and production env, we noticed that we have different behaviours.

https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/instrumentation/redis_cluster_validator.rb#L50

This logic is behaving inconcistent when trowing errors. For example

Gitlab::Redis::HLL.count(keys: [])

# In development we would see 

Gitlab::Instrumentation::RedisClusterValidator::CrossSlotError: Redis command PFCOUNT arguments hash to different slots. See https://docs.gitlab.com/ee/development/redis.html#multi-key-commands

# While in production we would see

Redis::CommandError (ERR wrong number of arguments for 'pfcount' command)

Summary

My first assumption was that Usage Ping will fail for any environment.

Usage ping is failing in tests and development as we do not treat Gitlab::Instrumentation::RedisClusterValidator::CrossSlotError exception, and will not fail in other environments as we catch Redis::CommandError and we return -1.

Questions

  • Why de we have different behaviours for development and testing?
  • Could we improve anything in this area to helps us have same behaviour?

Monitor weekly Usage ping generation

I have a GitLab installation version 13.6.0 where I plan to run tests for usage ping.

Data affected

The weeks affected would be weeks 1,2,3,4 from the beginning of the year.

Redis HLL monthly counters will return -1

(Looking to get a full list of metric names)

Monitoring table in Periscope

https://app.periscopedata.com/app/gitlab/484367/WIP:-Mathieu-Peychet's-scratch?widget=10655411&udv=760765

Edited by Alina Mihaila