Migrating etag cache store from SharedState to Cache

In gitlab-com/gl-infra/scalability#2441 (closed), groupscalability discussed the possibility of housing etag cache keys in Gitlab::Redis::Cache as it seemed more appropriate. etag cache keys are transient, expiring within 20 minutes of creation with no extension.

source

For GitLab SaaS, the etag cache middleware affects ~35-40% of web deployment traffic with a cache-hit rate of ~90%, hence doing a cutover might not be safe. Having a large mount of requests missing the cache would lead to a surge in database queries.

Proposed migration method

The proposed method of rollout would be controlled by a feature flag. The new logic in get would look up Cache and fallback to SharedState while .touch would be a direct cutover.

MultiStore can be used safely even if the SETNX Redis commands used in .touch could lead to a data race. For example,

proc A and proc B calls touch
# due to network routing, the ordering received on Redis could be interwoven if the calls are made simultaneously

proc A multistore calls SETNX on key XYZ in sharedstate -- succeeds
proc B multistore calls SETNX on key XYZ in sharedstate -- fails
proc B multistore calls SETNX on key XYZ in cache -- succeeds
proc A multistore calls SETNX on key XYZ in cache -- fails

the result is that proc A sets in sharedstate while proc B sets in cache

This would lead to a cache miss for proc B. But, without a migration, in the current state, proc B would get a cache miss anyway. To be precise, with n processes racing, there will be n-1 cache misses regardless.

To simplify the migration, we will use the MultiStore to perform

dual-write for at least 20 minutes
cut-over reads
stop dual-write
(in a separate MR), clean up the feature flag and release the feature

Feature release

The feature can be released with the fallback || Gitlab::Redis::SharedState... read to avoid a spike in cache miss. For SM users, assuming the rollout of the new version spans a certain period of time, there would be a window where there are both old and new versions apps deployed.

old apps would read and write to SharedState
new apps would read from Cache with fallback to SharedState and write to Cache.

There will be an increase in the rate of cache-miss during the rollout window. Each cache-miss will result in a .touch update. If the same key has differing values on both Redis during the migration window, a request will be a cache-miss depending on which version of the app it reaches. If the app's underlying Redis key value matches the 'requests's'requests' etag, it is a hit, else it is a cache miss. Over the deployment window, the % of cache misses will fall since more requests will reach the newer version app.

Once the deployment is finished, cache-miss rates will stabilize since all pods/rails-apps will only write to Cache. For the 50k reference architecture, it states that there are 12 Rails nodes.

for 10k and above: during the upgrade there will be an uptick in cache miss as the reference architecture recommends using 2 separate Redis instance for cache and persistent
under 10k users reference architecture: no behaviour changes expected because the read-write goes to the same Redis

Edited Aug 02, 2023 by Sylvester Chin