Benchmark effects of using pipeline over multi-key operations for Redis Cluster compatibility

In #1992 (closed), we identified a list of components in the Gitlab Rails application which perform multi-key operations across many hash-slots.

Class Operation Rails.cache Needs pipeline changes URL
ReactiveCacheSetCache unlink No Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/reactive_cache_set_cache.rb#L20
AvatarCache unlink No Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/avatar_cache.rb#L68
Cache::Helper mget Yes Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/cache/helpers.rb#L112
Gitlab::DiscussionsDiff::HighlightCache set, del, mget No Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/discussions_diff/highlight_cache.rb
Gitlab::SetCache unlink No Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/set_cache.rb#L25
Projects::BatchCountService mget Yes Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/batch_count_service.rb#L17
Gitlab::MarkdownCache::Redis::Store hmget No Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/markdown_cache/redis/store.rb
Gitlab::Cache::Import::Caching set No No https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/cache/import/caching.rb
Gitlab::Pages::CacheControl del Yes Yes https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pages/cache_control.rb
Click to show details on how list was obtained
  1. Look for instances of allow_cross_slot_commands
➜  gitlab git:(master) rg allow_cross_slot_commands -l
lib/gitlab/etag_caching/store.rb
lib/gitlab/pages/cache_control.rb
lib/gitlab/issues/rebalancing/state.rb
lib/gitlab/reactive_cache_set_cache.rb
lib/gitlab/set_cache.rb
lib/gitlab/repository_cache/preloader.rb
lib/gitlab/manifest_import/metadata.rb
lib/gitlab/markdown_cache/redis/store.rb
lib/gitlab/instrumentation/redis_cluster_validator.rb
lib/gitlab/repository_hash_cache.rb
lib/gitlab/discussions_diff/highlight_cache.rb
lib/gitlab/cache/import/caching.rb
lib/gitlab/cache/helpers.rb
lib/gitlab/avatar_cache.rb
lib/tasks/cache.rake
ee/app/services/elastic/indexing_control_service.rb
ee/app/services/elastic/process_bookkeeping_service.rb
app/services/projects/batch_count_service.rb
app/models/active_session.rb
app/models/ci/build_trace_chunks/redis_base.rb
  1. Separate into Redis types
## Repository Cache

- [ ]  lib/gitlab/repository_hash_cache.rb
- [ ]  lib/gitlab/repository_cache/preloader.rb

## Cache

- [ ]  app/services/projects/batch_count_service.rb (rails cache)
- [ ]  lib/gitlab/avatar_cache.rb
- [ ]  lib/gitlab/cache/helpers.rb (rails cache)
- [ ]  lib/gitlab/cache/import/caching.rb
- [ ]  lib/gitlab/discussions_diff/highlight_cache.rb
- [ ]  lib/gitlab/markdown_cache/redis/store.rb
- [ ]  lib/gitlab/set_cache.rb
- [ ]  lib/gitlab/reactive_cache_set_cache.rb
- [ ]  lib/gitlab/pages/cache_control.rb (rails cache)

## Shared state

- [ ]  app/models/ci/build_trace_chunks/redis_base.rb (via build_trace_chunks/redis.rb)
- [ ]  ee/app/services/elastic/process_bookkeeping_service.rb
- [ ]  ee/app/services/elastic/indexing_control_service.rb
- [ ]  lib/gitlab/manifest_import/metadata.rb
- [ ]  lib/gitlab/issues/rebalancing/state.rb
- [ ]  lib/gitlab/etag_caching/store.rb

## Sessions

- [ ]  app/models/active_session.rb

Before we can rollout the cross-slot pipeline functionality, we may need to understand the full effects of switching to a pipeline. While pipelines reduce network round-trips between the Rails client and the Redis server, there are server-side overheads such as queueing replies in memory and command parsing.

Proposal

Use feature flags to compare client-side and server-side differences between the current multi-key command and pipeline.

We should collect ~1 week of data to compare week-on-week. We may need to capture perf traces before and after (will require SRE assistance). This is a good chance to add our Rails.cache pipeline patch instead of cramming it within 1 large MR.

In summary it should require 2 application-side MRs

  1. Add feature-flag alternative for non Rails.cache multi key commands like unlink, del and mget
  2. Add feature-flag gated patch for Rails.cache's read_multi and delete_multi

If results from MR1 is positive, we can proceed to MR2 and invest effort into writing the patches.

Carrying out the benchmark/experiment

We will control the rollout using feature-flag. Specific details can be contained in the feature-flag issues like (gitlab-org/gitlab#409436 (closed)) so it is operationally traceable (chatops feature-flag -> feature flag config -> rollout issue).

In general we are looking out for

  1. Client-side apdex
  2. Server-side redis_primary_cpu saturation ratio
  3. Server-side redis_memory_cache / memory_redis_cache saturation

Pipelining can be deemed a valid approach if deviation within all 3 is within acceptable levels. Acceptable can be defined as

  1. Does not trigger any existing alerts
  2. Does not bring forward tamland's forecasted soft threshold violation date

Results

Benchmark 1

Summarised in feature-flag issue: gitlab-org/gitlab#409436 (comment 1389685778). In brief, CPU and memory on redis-cache worsened ever so slightly. Apdex on the client side had sharper dips in comparison with the offset (1 week ago) plot.

Conclusion: This performance degradation is within expectation and is not of serious concern.

Updates: Feature flag disabled due to https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23699. We can re-enable it after corrective-action is performed.

Benchmark 2

Benchmark started on 5 Jun 2023. Currently (6 June 2023) the feature flag is activated for 50% of random actors. See gitlab-org/gitlab#410115 (comment 1421075381) for more details.

On 7th June, the feature flag is disabled as the primary CPU % peaked at 85% (~5% higher than the past 4w) with the feature flag enabled at 50%. While that is still safe, there is little benefit to pushing this to 75-100% at the risk of saturation concerns for the sake of benchmarking.

Given the increase at 50%, our intended Redis Cluster setup (see sizing details) will be able to handle it since we are using 5 shards with 30% more vCPU (80vCPU vs the current 60vCPU). That means our redis-cluster-cache will have ~30% more vCPU than redis-cache.

Edited by Sylvester Chin