Build Redis VM's for the redis-sessions instance
To move the Session storage from the Shared/persistent Redis to a dedicated Redis, we need to build the Redis store.
Basics
VMs are still the preferred option for persistent storage until we choose to make a concerted effort to migrate these to kubernetes (including shared tooling, experience, documentation, etc)
Per our existing Redis deployments, we will need a 3 VM "cluster" with Redis Sentinel for failover. As for most we will run sentinel for this cluster on the Redis nodes as it's pretty much a wash in terms of failure resilience, is consistent with existing deployments, and doesn't bring in additional unnecessary complexity.
Sizing
Analysis conclusion from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1308:
redis-persistent
is using about12 GiB
of memory at present. With fragmentation, the RSS needed can go above that, so for headroom we want at least20 GiB
. We need to double that so that the page cache can hold an rdb dump for fast recovery. Bringing us to40 GiB
.Since sessions are ~71% of space in
redis-persistent
, we could go to30 GiB
as a baseline. But we want to have some burst capacity. So to be on the safe side, let's double that to60 GiB
.
Recommendation:
We should use a c2-standard-16
instance which has 16 vCPUs and 64 GiB
of RAM.
Note: We should review after 3 months and see if we can lower the burst ceiling to a 30GiB (c2-standard-8)
node.
Naming
redis-session or redis-sessions. redis-tracechunks suggests the latter (plural), but somehow redis-session still feels reasonable. DRIs choice when implementing.
Chose: redis-sessions
Redis Configuration
Although we are splitting this storage from the existing persistent Redis there is a reasonable argument to treat this as a hybrid between cache + persistent. With that in mind:
- Ensure it saves to disk (
gitlab_rb.redis.save
setting; see shared/sidekiq/tracechunks redis roles) - Has maxmemory settings like the cache instance (strongly consider
volatile-ttl
as the policy; see https://redis.io/topics/lru-cache for all options) - Add saturation metrics for memory that will page when it reaches a high threshold but below full usage (75-80% seems reasonable).
- Current saturation metrics look at redis usage as a proportion of total system RAM, which is valid so far but for this instance we need to alert when we reach the desired threshold of the configured
maxmemory
so we can decide if it's anomalous (perhaps an incident that needs mitigation) or the result of natural growth (requires growing the node). We do not want this to quietly grow to maxmemory and start evicting (which is fine for the cache instance) - Alternatively:
- Carefully set the % threshold of total system RAM to be effectively this limit of maxmemory (possibly fragile) or
- Alert if
redis_evicted_keys_total
goes above zero for this instance. This is not ideal though, as it catches after-the-fact, not before hand.
- Current saturation metrics look at redis usage as a proportion of total system RAM, which is valid so far but for this instance we need to alert when we reach the desired threshold of the configured
Tasks
-
Chef roles -
Terraform -
Runbooks: -
Dashboard -
Add to metrics catalog -
Add to service catalog -
Documentation - See gitlab-com/runbooks!3921 (merged) and gitlab-com/runbooks!4042 (merged) for a recent example of similar work.
-
-
Configure the instance in the application: &598 (comment 720912204) -
Kubernetes (for the vast majority of the application) -
Chef (for the console nodes)
-
-
Expire the silence on type="redis-sessions"
once traffic levels are stable.
History
See #1246 (closed) for recent near-identical activity.
Status
2021-11-12
New VMs are built in both environments.
Observability MR (runbooks) will need some refactoring for other architectural work happening in the metrics catalog, but that's not critical (we don't need o11y until we are actually running things on the new VMs).
Configuring the new instance into the application is not technically blocked on observability, but for safety we should have clear metrics available in case of unexpected behavior.