Redis 6 upgrade
We are getting very close to saturation on two of our redis clusters:
- sidekiq: Capacity Planning: redis-sidekiq service, redis_primary_cpu is trending towards saturation around April 2021
- persistent: Capacity Planning: redis Service, redis_primary_cpu resource
One way to buy us some more time before having to shard or otherwise partition workloads is to upgrade to Redis 6. This new version supports threaded I/O.
From the CHANGELOG:
- Redis can now optionally use threads to handle I/O, allowing to serve 2 times as much operations per second in a single instance when pipelining cannot be used.
Our profiling shows a lot of time being spent in syscalls, so we can expect this to help quite a bit.
This epic exists to track the effort of rolling out this upgrade to staging and production.
Results
redis-persistent
Threaded I/O enabled on 2021-03-11.
Annotated short-term timeline from the day after (source):
Annotated long-term tamland report:
redis-sidekiq
We applied a sidekiq configuration change on 2021-03-27 which helped a lot (source):
Upgrade to 6.0 required an upstream patch to continue rollout.
Further results TBD.
cc @AnthonySandoval @andrewn @smcgivern @cmiskell @gitlab-com/gl-infra/sre-observability
- Show closed items
- View on a roadmap
- Show labels
- Show closed items
Activity
- Edited by Igor
Collecting some evidence to support the hypothesis that this upgrade will help.
redis-sidekiq
iwiedler@redis-sidekiq-03-db-gprd.c.gitlab-production.internal:~$ sudo perf record -ag -F 497 -- sleep 60 sudo perf script --header | stackcollapse-perf.pl --kernel | grep redis-server | flamegraph.pl --hash --colors=perl > flamegraph.redis-sidekiq.svg
On the redis-sidekiq cluster, redis is spending about 16% of its CPU time in read and write syscalls.
This suggests we may need to something else here, as threaded I/O won't buy us as much.
redis (persistent)
iwiedler@redis-03-db-gprd.c.gitlab-production.internal:~$ sudo perf record -ag -F 497 -- sleep 60 sudo perf script --header | stackcollapse-perf.pl --kernel | grep redis-server | flamegraph.pl --hash --colors=perl > flamegraph.redis-persistent.svg
On redis-persistent it's even more dramatic. Almost 49% of CPU is spent on read/write syscalls, most of which is writes.
We can expect to benefit from threaded I/O a lot here.
@bjk-gitlab noted that we can consider bumping redis_exporter while we're at it. We should review whether there are any Redis 6 specific additions.
@igorwwwwwwwwwwwwwwwwwwww is upgrading redis-cache on the roadmap as well? That might reduce the need for scalability#711 in scalability#751 we did see a jump in CPU in the tamland-projections when we started tracking RackAttack requests in redis.
@reprazent redis-cache utilization is less critical, usually peaking in the 60-80% range. The other two are much closer to saturation, that's why those were prioritized.
That said, as we've just enabled rack::attack more widely, we'll need to keep a close eye on it to see if it reaches similar critical levels.
That said, as we've just enabled rack::attack more widely, we'll need to keep a close eye on it to see if it reaches similar critical levels.
The rack attack tracking has been enabled since ~ november:
(old tamland)
But then holidays came and the projections haven't settled from both events yet:
It makes sense to keep an eye on this. But I don't expect any increase in redis-cache cpu usage because we've now enabled Rack::Attack for real.