Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Build Redis VM's for the redis-sessions instance

To move the Session storage from the Shared/persistent Redis to a dedicated Redis, we need to build the Redis store.

Basics

VMs are still the preferred option for persistent storage until we choose to make a concerted effort to migrate these to kubernetes (including shared tooling, experience, documentation, etc)

Per our existing Redis deployments, we will need a 3 VM "cluster" with Redis Sentinel for failover. As for most we will run sentinel for this cluster on the Redis nodes as it's pretty much a wash in terms of failure resilience, is consistent with existing deployments, and doesn't bring in additional unnecessary complexity.

Sizing

Analysis conclusion from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1308:

redis-persistent is using about 12 GiB of memory at present. With fragmentation, the RSS needed can go above that, so for headroom we want at least 20 GiB. We need to double that so that the page cache can hold an rdb dump for fast recovery. Bringing us to 40 GiB.

Since sessions are ~71% of space in redis-persistent, we could go to 30 GiB as a baseline. But we want to have some burst capacity. So to be on the safe side, let's double that to 60 GiB.

Recommendation: We should use a c2-standard-16 instance which has 16 vCPUs and 64 GiB of RAM.

Note: We should review after 3 months and see if we can lower the burst ceiling to a 30GiB (c2-standard-8) node.

Naming

redis-session or redis-sessions. redis-tracechunks suggests the latter (plural), but somehow redis-session still feels reasonable. DRIs choice when implementing.

Chose: redis-sessions

Redis Configuration

Although we are splitting this storage from the existing persistent Redis there is a reasonable argument to treat this as a hybrid between cache + persistent. With that in mind:

Ensure it saves to disk (gitlab_rb.redis.save setting; see shared/sidekiq/tracechunks redis roles)
Has maxmemory settings like the cache instance (strongly consider volatile-ttl as the policy; see https://redis.io/topics/lru-cache for all options)
Add saturation metrics for memory that will page when it reaches a high threshold but below full usage (75-80% seems reasonable).
- Current saturation metrics look at redis usage as a proportion of total system RAM, which is valid so far but for this instance we need to alert when we reach the desired threshold of the configured maxmemory so we can decide if it's anomalous (perhaps an incident that needs mitigation) or the result of natural growth (requires growing the node). We do not want this to quietly grow to maxmemory and start evicting (which is fine for the cache instance)
- Alternatively:
  1. Carefully set the % threshold of total system RAM to be effectively this limit of maxmemory (possibly fragile) or
  2. Alert if redis_evicted_keys_total goes above zero for this instance. This is not ideal though, as it catches after-the-fact, not before hand.

Tasks

Chef roles
Terraform
Runbooks:
- Dashboard
- Add to metrics catalog
- Add to service catalog
- Documentation
- See gitlab-com/runbooks!3921 (merged) and gitlab-com/runbooks!4042 (merged) for a recent example of similar work.
Configure the instance in the application: &598 (comment 720912204)
- Kubernetes (for the vast majority of the application)
- Chef (for the console nodes)
Expire the silence on type="redis-sessions" once traffic levels are stable.

History

See #1246 (closed) for recent near-identical activity.

Status

2021-11-12

New VMs are built in both environments.

Observability MR (runbooks) will need some refactoring for other architectural work happening in the metrics catalog, but that's not critical (we don't need o11y until we are actually running things on the new VMs).

Configuring the new instance into the application is not technically blocked on observability, but for safety we should have clear metrics available in case of unexpected behavior.

Edited Dec 13, 2021 by Nikola Milojevic

Assignee

Time tracking