Benchmark effects of using pipeline over multi-key operations for Redis Cluster compatibility
In #1992 (closed), we identified a list of components in the Gitlab Rails application which perform multi-key operations across many hash-slots.
Click to show details on how list was obtained
- Look for instances of
allow_cross_slot_commands
➜ gitlab git:(master) rg allow_cross_slot_commands -l
lib/gitlab/etag_caching/store.rb
lib/gitlab/pages/cache_control.rb
lib/gitlab/issues/rebalancing/state.rb
lib/gitlab/reactive_cache_set_cache.rb
lib/gitlab/set_cache.rb
lib/gitlab/repository_cache/preloader.rb
lib/gitlab/manifest_import/metadata.rb
lib/gitlab/markdown_cache/redis/store.rb
lib/gitlab/instrumentation/redis_cluster_validator.rb
lib/gitlab/repository_hash_cache.rb
lib/gitlab/discussions_diff/highlight_cache.rb
lib/gitlab/cache/import/caching.rb
lib/gitlab/cache/helpers.rb
lib/gitlab/avatar_cache.rb
lib/tasks/cache.rake
ee/app/services/elastic/indexing_control_service.rb
ee/app/services/elastic/process_bookkeeping_service.rb
app/services/projects/batch_count_service.rb
app/models/active_session.rb
app/models/ci/build_trace_chunks/redis_base.rb
- Separate into Redis types
## Repository Cache
- [ ] lib/gitlab/repository_hash_cache.rb
- [ ] lib/gitlab/repository_cache/preloader.rb
## Cache
- [ ] app/services/projects/batch_count_service.rb (rails cache)
- [ ] lib/gitlab/avatar_cache.rb
- [ ] lib/gitlab/cache/helpers.rb (rails cache)
- [ ] lib/gitlab/cache/import/caching.rb
- [ ] lib/gitlab/discussions_diff/highlight_cache.rb
- [ ] lib/gitlab/markdown_cache/redis/store.rb
- [ ] lib/gitlab/set_cache.rb
- [ ] lib/gitlab/reactive_cache_set_cache.rb
- [ ] lib/gitlab/pages/cache_control.rb (rails cache)
## Shared state
- [ ] app/models/ci/build_trace_chunks/redis_base.rb (via build_trace_chunks/redis.rb)
- [ ] ee/app/services/elastic/process_bookkeeping_service.rb
- [ ] ee/app/services/elastic/indexing_control_service.rb
- [ ] lib/gitlab/manifest_import/metadata.rb
- [ ] lib/gitlab/issues/rebalancing/state.rb
- [ ] lib/gitlab/etag_caching/store.rb
## Sessions
- [ ] app/models/active_session.rb
Before we can rollout the cross-slot pipeline functionality, we may need to understand the full effects of switching to a pipeline. While pipelines reduce network round-trips between the Rails client and the Redis server, there are server-side overheads such as queueing replies in memory and command parsing.
Proposal
Use feature flags to compare client-side and server-side differences between the current multi-key command and pipeline.
We should collect ~1 week of data to compare week-on-week. We may need to capture perf traces before and after (will require SRE assistance). This is a good chance to add our Rails.cache pipeline patch instead of cramming it within 1 large MR.
In summary it should require 2 application-side MRs
- Add feature-flag alternative for non
Rails.cache
multi key commands likeunlink
,del
andmget
- Add feature-flag gated patch for
Rails.cache
'sread_multi
anddelete_multi
If results from MR1 is positive, we can proceed to MR2 and invest effort into writing the patches.
Carrying out the benchmark/experiment
We will control the rollout using feature-flag. Specific details can be contained in the feature-flag issues like (gitlab-org/gitlab#409436 (closed)) so it is operationally traceable (chatops feature-flag -> feature flag config -> rollout issue).
In general we are looking out for
- Client-side apdex
- Server-side
redis_primary_cpu
saturation ratio - Server-side
redis_memory_cache
/memory_redis_cache
saturation
Pipelining can be deemed a valid approach if deviation within all 3 is within acceptable levels. Acceptable can be defined as
- Does not trigger any existing alerts
- Does not bring forward tamland's forecasted soft threshold violation date
Results
Benchmark 1
Summarised in feature-flag issue: gitlab-org/gitlab#409436 (comment 1389685778). In brief, CPU and memory on redis-cache
worsened ever so slightly. Apdex on the client side had sharper dips in comparison with the offset (1 week ago) plot.
Conclusion: This performance degradation is within expectation and is not of serious concern.
Updates: Feature flag disabled due to https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23699. We can re-enable it after corrective-action is performed.
Benchmark 2
Benchmark started on 5 Jun 2023. Currently (6 June 2023) the feature flag is activated for 50% of random actors. See gitlab-org/gitlab#410115 (comment 1421075381) for more details.
On 7th June, the feature flag is disabled as the primary CPU % peaked at 85% (~5% higher than the past 4w) with the feature flag enabled at 50%. While that is still safe, there is little benefit to pushing this to 75-100% at the risk of saturation concerns for the sake of benchmarking.
Given the increase at 50%, our intended Redis Cluster setup (see sizing details) will be able to handle it since we are using 5 shards with 30% more vCPU (80vCPU vs the current 60vCPU). That means our redis-cluster-cache
will have ~30% more vCPU than redis-cache
.