Migrating etag cache store from SharedState to Cache
In gitlab-com/gl-infra/scalability#2441 (closed), groupscalability discussed the possibility of housing etag cache keys in Gitlab::Redis::Cache
as it seemed more appropriate. etag cache keys are transient, expiring within 20 minutes of creation with no extension.
For GitLab SaaS, the etag cache middleware affects ~35-40% of web
deployment traffic with a cache-hit rate of ~90%, hence doing a cutover might not be safe. Having a large mount of requests missing the cache would lead to a surge in database queries.
Proposed migration method
The proposed method of rollout would be controlled by a feature flag. The new logic in get
would look up Cache
and fallback to SharedState
while .touch
would be a direct cutover.
MultiStore
can be used safely even if the SETNX
Redis commands used in .touch
could lead to a data race. For example,
proc A and proc B calls touch
# due to network routing, the ordering received on Redis could be interwoven if the calls are made simultaneously
proc A multistore calls SETNX on key XYZ in sharedstate -- succeeds
proc B multistore calls SETNX on key XYZ in sharedstate -- fails
proc B multistore calls SETNX on key XYZ in cache -- succeeds
proc A multistore calls SETNX on key XYZ in cache -- fails
the result is that proc A sets in sharedstate while proc B sets in cache
This would lead to a cache miss for proc B. But, without a migration, in the current state, proc B would get a cache miss anyway. To be precise, with n
processes racing, there will be n-1 cache misses regardless.
To simplify the migration, we will use the MultiStore to perform
- dual-write for at least 20 minutes
- cut-over reads
- stop dual-write
- (in a separate MR), clean up the feature flag and release the feature
Feature release
The feature can be released with the fallback || Gitlab::Redis::SharedState...
read to avoid a spike in cache miss. For SM users, assuming the rollout of the new version spans a certain period of time, there would be a window where there are both old and new versions apps deployed.
- old apps would read and write to
SharedState
- new apps would read from
Cache
with fallback toSharedState
and write toCache
.
There will be an increase in the rate of cache-miss during the rollout window. Each cache-miss will result in a .touch
update. If the same key has differing values on both Redis during the migration window, a request will be a cache-miss depending on which version of the app it reaches. If the app's underlying Redis key value matches the 'requests's'requests' etag, it is a hit, else it is a cache miss. Over the deployment window, the % of cache misses will fall since more requests will reach the newer version app.
Once the deployment is finished, cache-miss rates will stabilize since all pods/rails-apps will only write to Cache
. For the 50k reference architecture, it states that there are 12 Rails nodes.
- for 10k and above: during the upgrade there will be an uptick in cache miss as the reference architecture recommends using 2 separate Redis instance for cache and persistent
- under 10k users reference architecture: no behaviour changes expected because the read-write goes to the same Redis