Skip to content
Snippets Groups Projects
Closed Redis 6 upgrade
  • Redis 6 upgrade

  • Redis 6 upgrade

    Closed Epic created by Igor

    We are getting very close to saturation on two of our redis clusters:

    One way to buy us some more time before having to shard or otherwise partition workloads is to upgrade to Redis 6. This new version supports threaded I/O.

    From the CHANGELOG:

    • Redis can now optionally use threads to handle I/O, allowing to serve 2 times as much operations per second in a single instance when pipelining cannot be used.

    Our profiling shows a lot of time being spent in syscalls, so we can expect this to help quite a bit.

    This epic exists to track the effort of rolling out this upgrade to staging and production.

    Results

    redis-persistent

    Threaded I/O enabled on 2021-03-11.

    Annotated short-term timeline from the day after (source):

    image

    Annotated long-term tamland report:

    tamland_threaded_io

    redis-sidekiq

    We applied a sidekiq configuration change on 2021-03-27 which helped a lot (source):

    image

    image

    Upgrade to 6.0 required an upstream patch to continue rollout.

    Further results TBD.

    cc @AnthonySandoval @andrewn @smcgivern @cmiskell @gitlab-com/gl-infra/sre-observability

    Edited by Igor

    Child items ...

  • View on a roadmap
  • Linked items 0

  • Link items together to show that they're related or that one is blocking others.

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    • Igor added 1 deleted label
    • Igor added epic &353 (closed) as parent epic
    • Igor changed the description ·
    • Igor mentioned in merge request gitlab-org/omnibus-gitlab!4930
    • Igor
      Author

      Collecting some evidence to support the hypothesis that this upgrade will help.

      redis-sidekiq

      iwiedler@redis-sidekiq-03-db-gprd.c.gitlab-production.internal:~$
      
      sudo perf record -ag -F 497 -- sleep 60
      sudo perf script --header | stackcollapse-perf.pl --kernel | grep redis-server | flamegraph.pl --hash --colors=perl > flamegraph.redis-sidekiq.svg

      On the redis-sidekiq cluster, redis is spending about 16% of its CPU time in read and write syscalls.

      This suggests we may need to something else here, as threaded I/O won't buy us as much.

      flamegraph.redis-sidekiq.svg

      redis (persistent)

      iwiedler@redis-03-db-gprd.c.gitlab-production.internal:~$
      
      sudo perf record -ag -F 497 -- sleep 60
      sudo perf script --header | stackcollapse-perf.pl --kernel | grep redis-server | flamegraph.pl --hash --colors=perl > flamegraph.redis-persistent.svg

      On redis-persistent it's even more dramatic. Almost 49% of CPU is spent on read/write syscalls, most of which is writes.

      We can expect to benefit from threaded I/O a lot here.

      flamegraph.redis-persistent.svg

      Edited by Igor
    • Igor
      Author

      @bjk-gitlab noted that we can consider bumping redis_exporter while we're at it. We should review whether there are any Redis 6 specific additions.

    • Igor mentioned in issue scalability#655
    • Igor mentioned in issue scalability#590
      • Bob Van Landuyt

        @igorwwwwwwwwwwwwwwwwwwww is upgrading redis-cache on the roadmap as well? That might reduce the need for scalability#711 in scalability#751 we did see a jump in CPU in the tamland-projections when we started tracking RackAttack requests in redis.

      • Igor
        Author

        @reprazent redis-cache utilization is less critical, usually peaking in the 60-80% range. The other two are much closer to saturation, that's why those were prioritized.

        That said, as we've just enabled rack::attack more widely, we'll need to keep a close eye on it to see if it reaches similar critical levels.

      • Bob Van Landuyt

        That said, as we've just enabled rack::attack more widely, we'll need to keep a close eye on it to see if it reaches similar critical levels.

        The rack attack tracking has been enabled since ~ november:

        tamland

        (old tamland)

        But then holidays came and the projections haven't settled from both events yet:

        image

        It makes sense to keep an eye on this. But I don't expect any increase in redis-cache cpu usage because we've now enabled Rack::Attack for real.

      • Please register or sign in to reply
    • Igor mentioned in issue scalability#804
    • Igor mentioned in merge request gitlab-org/charts/gitlab!1772
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading