Benchmark effects of using pipeline over multi-key operations for Redis Cluster compatibility

In #1992 (closed), we identified a list of components in the Gitlab Rails application which perform multi-key operations across many hash-slots.

Class	Operation	Rails.cache	Needs pipeline changes	URL
`ReactiveCacheSetCache`	`unlink`	No	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/reactive_cache_set_cache.rb#L20
`AvatarCache`	`unlink`	No	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/avatar_cache.rb#L68
`Cache::Helper`	`mget`	Yes	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/cache/helpers.rb#L112
`Gitlab::DiscussionsDiff::HighlightCache`	`set`, `del`, `mget`	No	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/discussions_diff/highlight_cache.rb
`Gitlab::SetCache`	`unlink`	No	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/set_cache.rb#L25
`Projects::BatchCountService`	`mget`	Yes	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/batch_count_service.rb#L17
`Gitlab::MarkdownCache::Redis::Store`	`hmget`	No	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/markdown_cache/redis/store.rb
`Gitlab::Cache::Import::Caching`	`set`	No	No	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/cache/import/caching.rb
`Gitlab::Pages::CacheControl`	`del`	Yes	Yes	https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pages/cache_control.rb

Click to show details on how list was obtained

Look for instances of allow_cross_slot_commands

➜  gitlab git:(master) rg allow_cross_slot_commands -l
lib/gitlab/etag_caching/store.rb
lib/gitlab/pages/cache_control.rb
lib/gitlab/issues/rebalancing/state.rb
lib/gitlab/reactive_cache_set_cache.rb
lib/gitlab/set_cache.rb
lib/gitlab/repository_cache/preloader.rb
lib/gitlab/manifest_import/metadata.rb
lib/gitlab/markdown_cache/redis/store.rb
lib/gitlab/instrumentation/redis_cluster_validator.rb
lib/gitlab/repository_hash_cache.rb
lib/gitlab/discussions_diff/highlight_cache.rb
lib/gitlab/cache/import/caching.rb
lib/gitlab/cache/helpers.rb
lib/gitlab/avatar_cache.rb
lib/tasks/cache.rake
ee/app/services/elastic/indexing_control_service.rb
ee/app/services/elastic/process_bookkeeping_service.rb
app/services/projects/batch_count_service.rb
app/models/active_session.rb
app/models/ci/build_trace_chunks/redis_base.rb

Separate into Redis types

## Repository Cache

- [ ]  lib/gitlab/repository_hash_cache.rb
- [ ]  lib/gitlab/repository_cache/preloader.rb

## Cache

- [ ]  app/services/projects/batch_count_service.rb (rails cache)
- [ ]  lib/gitlab/avatar_cache.rb
- [ ]  lib/gitlab/cache/helpers.rb (rails cache)
- [ ]  lib/gitlab/cache/import/caching.rb
- [ ]  lib/gitlab/discussions_diff/highlight_cache.rb
- [ ]  lib/gitlab/markdown_cache/redis/store.rb
- [ ]  lib/gitlab/set_cache.rb
- [ ]  lib/gitlab/reactive_cache_set_cache.rb
- [ ]  lib/gitlab/pages/cache_control.rb (rails cache)

## Shared state

- [ ]  app/models/ci/build_trace_chunks/redis_base.rb (via build_trace_chunks/redis.rb)
- [ ]  ee/app/services/elastic/process_bookkeeping_service.rb
- [ ]  ee/app/services/elastic/indexing_control_service.rb
- [ ]  lib/gitlab/manifest_import/metadata.rb
- [ ]  lib/gitlab/issues/rebalancing/state.rb
- [ ]  lib/gitlab/etag_caching/store.rb

## Sessions

- [ ]  app/models/active_session.rb

Before we can rollout the cross-slot pipeline functionality, we may need to understand the full effects of switching to a pipeline. While pipelines reduce network round-trips between the Rails client and the Redis server, there are server-side overheads such as queueing replies in memory and command parsing.

Proposal

Use feature flags to compare client-side and server-side differences between the current multi-key command and pipeline.

We should collect ~1 week of data to compare week-on-week. We may need to capture perf traces before and after (will require SRE assistance). This is a good chance to add our Rails.cache pipeline patch instead of cramming it within 1 large MR.

In summary it should require 2 application-side MRs

Add feature-flag alternative for non Rails.cache multi key commands like unlink, del and mget
Add feature-flag gated patch for Rails.cache's read_multi and delete_multi

If results from MR1 is positive, we can proceed to MR2 and invest effort into writing the patches.

Carrying out the benchmark/experiment

We will control the rollout using feature-flag. Specific details can be contained in the feature-flag issues like (gitlab-org/gitlab#409436 (closed)) so it is operationally traceable (chatops feature-flag -> feature flag config -> rollout issue).

In general we are looking out for

Client-side apdex
Server-side redis_primary_cpu saturation ratio
Server-side redis_memory_cache / memory_redis_cache saturation

Pipelining can be deemed a valid approach if deviation within all 3 is within acceptable levels. Acceptable can be defined as

Does not trigger any existing alerts
Does not bring forward tamland's forecasted soft threshold violation date

Results

Benchmark 1

Summarised in feature-flag issue: gitlab-org/gitlab#409436 (comment 1389685778). In brief, CPU and memory on redis-cache worsened ever so slightly. Apdex on the client side had sharper dips in comparison with the offset (1 week ago) plot.

Conclusion: This performance degradation is within expectation and is not of serious concern.

Updates: Feature flag disabled due to https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23699. We can re-enable it after corrective-action is performed.

Benchmark 2

Benchmark started on 5 Jun 2023. Currently (6 June 2023) the feature flag is activated for 50% of random actors. See gitlab-org/gitlab#410115 (comment 1421075381) for more details.

On 7th June, the feature flag is disabled as the primary CPU % peaked at 85% (~5% higher than the past 4w) with the feature flag enabled at 50%. While that is still safe, there is little benefit to pushing this to 75-100% at the risk of saturation concerns for the sake of benchmarking.

Given the increase at 50%, our intended Redis Cluster setup (see sizing details) will be able to handle it since we are using 5 shards with 30% more vCPU (80vCPU vs the current 60vCPU). That means our redis-cluster-cache will have ~30% more vCPU than redis-cache.

Edited Jun 07, 2023 by Sylvester Chin