DB multi-request stickiness doesn't stick as expected
Extracted from gitlab-org/gitlab!49294 (comment 543682086)
Our DB Load Balancing layer has a special sticking mechanism spreading across requests. This mechanism ensures the read consistency caused by replicas' replication lag. The flow looks like this:
- The namespace of the stickiness can be configured. By default, the namespace is by user id.
- After a request ends, a rack middleware writes the current write location into Redis if the request performed a write ever.
- In the next request, inside the middleware, that last write location is compared with all the replica's LSN to determine whether all replicas are caught up. If any of them doesn't, all queries inside that session stick to the primary.
- In APIs, the namespace of the stickiness can be set when the main object inside the controller is found:
def current_job
id = params[:id]
if id
::Gitlab::Database::LoadBalancing::RackMiddleware
.stick_or_unstick(env, :build, id)
end
super
end
override :current_runner
def current_runner
token = params[:token]
if token
::Gitlab::Database::LoadBalancing::RackMiddleware
.stick_or_unstick(env, :runner, token)
end
super
end
def current_user
strong_memoize(:current_user) do
user = super
if user
::Gitlab::Database::LoadBalancing::RackMiddleware
.stick_or_unstick(env, :user, user.id)
end
user
end
end
If an endpoint decide to scope to a namespace, it sets the namespace into the request variable's hash:
def self.stick_or_unstick(env, namespace, id)
return unless LoadBalancing.enable?
Sticking.unstick_or_continue_sticking(namespace, id)
env[STICK_OBJECT] = [namespace, id]
end
However, as the hash key is fixed, it stores the last namespace it receives. Hence, after a request ends, the last write location is written for only 1 namespace. That leads the following requests not sticky as expected.
Solution
Expand the env[STICK_OBJECT]
to an array, and handle the caller accordingly. In case a request has 3 namespaces, when an object is initialize, the corresponding namespace is checked. If any of them has a lagging write location, the request sticks to primary.