Multistore live migration logic improvements?
@msmiley @igorwwwwwwwwwwwwwwwwwwww and I walked through the multistore live migration logic yesterday as part of the effort to roll out redis-repository-cache, and I wanted to document it and open this for discussion. Thankfully Matt already did some of this, so I'm going to steal and edit his information and modify it to fit the effort we did in production#8309 (closed)
How multistore works during a live migration
- Old datastore: secondary (redis-cache)
- New datastore: primary (redis-repository-cache)
Feature flags:
- use_primary_and_secondary_stores_for_repository_cache
- use_primary_store_as_default_for_repository_cache
Rollout phases:
- Starting state, no multistore, writing directly to old/secondary data store
- Start dual writes
- Shift reads to primary data store
- Stop dual writes, write only to new/primary data store
UPDATE: The below diagram is how we thought it worked until 2023-01-31.
Assumed state transitions for redis-repository-cache:
Phase of rollout | 1 | 2 | 3 | 4 |
---|---|---|---|---|
FF: use_primary_and_secondary_stores
|
f | t | t | f |
FF: use_primary_store_as_default
|
f | f | t | t |
New datastore (primary) | RW | RW | RW | |
Old datastore (secondary) | RW | RW | RW | |
Preferred ("default") datastore (P/S) | S | S | P | P |
UPDATE: In contrast to the above, the following diagram is how it really works (as of today, 2023-01-31).
Actual state transitions for redis-repository-cache:
Phase of rollout | 1 | 2 | 3 | 4 |
---|---|---|---|---|
FF: use_primary_and_secondary_stores
|
f | t | t | f |
FF: use_primary_store_as_default
|
f | f | t | t |
New datastore (primary) | RW | RW | RW | |
Old datastore (secondary) | RW | RW | RW | |
Preferred ("default") datastore (P/S) | S | P | P | P |
One key note above -- as you can see, the flag use_primary_store_as_default does nothing until use_primary_and_secondary_stores is turned off. You can see that on lines 114 - 122 in the code below.
Code in question:
https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/redis/multi_store.rb#L95-142
https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/redis/multi_store.rb#L171-177
https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/redis/multi_store.rb#L230-232
https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/redis/multi_store.rb#L243-321
(Probably more too, but those are what I saw on first glance)
How should it work?
This is the topic for discussion, @gitlab-org/scalability. We had one incident during the rollout yesterday that was at least partially caused by slowness due to reading from an unwarmed cache.
At the most simplistic of solutions, it seems like changing https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/redis/multi_store.rb#L259-265 to read from secondary unless use_primary_store_as_default is set would be a good start to allowing us more fine grained control over where we're sending reads, but I know that Matt and Igor had some other suggestions.
Summary of current state
See above Actual state transitions for redis-repository-cache
for prior behaviour.
The current state (with both gitlab-org/gitlab!104210 (merged) and gitlab-org/gitlab!111893 (merged) merged) is:
Phase of rollout | 1 | 2 | 3 | 4 |
---|---|---|---|---|
FF: use_primary_and_secondary_stores
|
f | t | t | f |
FF: use_primary_store_as_default
|
f | f | t | t |
New datastore (primary) | W | RW | RW | |
Old datastore (secondary) | RW | RW | W | |
Preferred ("default") datastore (P/S) | S | S | P | P |
The 4 feature flag permutations now gives us 4 distinct read-write behaviours. Note that fallback reads are removed to prevent performance regressions during the initial portion of phase 2.