Identify most significant improvements to improve Redis Resilience
Background
Redis is at the core of our infrastructure, and the scalability of GitLab.com is reliant on Redis being up to the task. We should focus on some incremental improvements to our Redis stack which clear the way for continued future improvements.
Previously, the following items were noted:
- Redis Session State sharded away from the main Redis Persistent instance (50% SRE) #186 (closed)
- Redis-Sidekiq Sharding - related to potential Redis-sidekiq saturation: #191 (closed) (closed)
- Andrew: probably deprioritise this
- Build the Redis Cross-slot Validator gitlab-org/gitlab#206903 (moved)
- Redis Observability (also Observability Theme)
- Apdex scores for Redis Persistent
- Logging of redis call activity
- Smembers Improvements https://gitlab.com/gitlab-com/gl-infra/scalability/issues/124
- Redis AOF Persistence format
- Investigate append only format persistence for all redis
- Andrew: Igor has expressed interest in doing this: should we push this onto the resilience teams?
Another thing to consider is that there might be other teams in the Infrastructure department that are also investigating the performance of Redis. We need to reach out to them and synchronize our efforts.
Existing Issues tagged for Service::Redis
Link | Title |
---|---|
#1 (closed) | Huge incoming emails are being loaded into Redis |
#2 (closed) | JobWaiter garbage collection / persistence problems |
#48 (closed) | Investigation Redis Cache CPU Saturation |
#49 (closed) | Use multiple Redis cache instances in Rails.cache |
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/124 | Very poor Redis set cache performance on repositories with huge branch and tag sets (item 5 in list above) |
#133 | Redis failover should be tested as part of our QA integration test suite |
#157 (closed) | Observability for Persistent Redis calls |
#186 (closed) | Move Redis Session State Store to an separate Redis instance (item 1 in list above) |
#187 (closed) | Send Redis Slowlog Events to Structured Logging |
gitlab-org/gitlab#206903 (moved) | Redis CROSSLOT validator (item 3 in list above) |
Exit Criteria
-
Infrastructure teams contacted to determine if there are other Redis investigation efforts and these are combined with ours where possible. -
Projects (tracked by Epics) are created for separate aspects of Redis resilience work, and they have clear goals. -
All existing issues listed above are triaged. They have been added to an epic, closed, or marked why that they will not be addressed as part of this effort. -
Projects have a first-pass prioritization applied
Edited by Rachel Nienaber