Move Trace Chunk storage to it's own dedicated Redis instance

Background

A trace chunk leak lead to excessive Redis storage in #327263 (closed). To avoid memory saturation, the instance was expanded in gitlab-com/gl-infra/production#4194 (closed).

This incident highlighted the approach that we currently use, of storing trace chunks in the Redis Persistent, or shared state, instance.

Trace chunk storage is a distinctive workflow from much of the other traffic we send to Redis persistent instance. Additionally, it makes sense to isolate this traffic, since the availability of GitLab.com is highly reliant on the functioning of the persistent redis (authentication, sessions, load balancing, more), so moving this traffic to it's own independent instance would make sense.

Proposal: Move Trace Chunk storage to it's own dedicated Redis instance

In self-managed, this instance would default to the existing shared state redis instance.

Technical Details

The Redis implementation is encapsulated in app/models/ci/build_trace_chunks/redis.rb, so the code change would be relatively contained.
Online migration might be a bit tricky, but we could probably use a four stage migration (write-both, read-old, write-both, read-new, write-new, read-new, cleanup old), using feature-flags to control the stages.
As an additional safety mechanism, we could potentially use max-memory on the redis-chunks instance, using a volatile-ttl eviction strategy. This would mean that instead of OOMing if we ever experienced a similar leak in future, old keys would be evicted.
1. This is not ideal, but probably better than OOMing in this case.

@grzesiek @mwasilewski-gitlab @igorwwwwwwwwwwwwwwwwwwww @marin @rnienaber @cheryl.li

Edited Apr 12, 2021 by Andrew Newdigate