Fix NPE when scheduling Sidekiq job and no DB replicas available
What does this MR do and why?
We currently write the Write Ahead Location pointer (WAL) to a Sidekiq job hash when load balancing is enabled. This allows Sidekiq workers to make decisions based on how far replicas are lagging behind the primary.
We query the WAL in web workers in the Sidekiq client middleware. In the case where only reads had occurred through the current database connection, we obtain the WAL from a replica provided by the LB. However, in the case where no replica is available at all, this returns nil
, which the middleware never accounted for. This will crash the request.
We discussed various approaches to fix this but for now settled on the simplest one because there were open questions as to the semantics and implications of some of the "proper" fixes. See open questions below.
Here we do the simplest thing we could think of that should still provide reliable and correct behavior: fall back to the primary WAL in case no replicas are available.
This MR also removes the wal_location_source
log field introduced in !90252 (merged) since that was complicating this fix. The WAL source was only used for debugging and is also of questionable value now that we split the database into main and CI.
Alternative approaches
The drawback of this fix is that it is more of a band-aid; one could argue that it should be the LB who figures this out for us and just provides us with a proper WAL, whether obtained from the primary or not.
Such a solution was outlined here: #348510 (comment 1189653558)
def read_connection_write_location
# When only using the primary there's no need for any WAL queries.
return if primary_only?
if ::Gitlab::Database::LoadBalancing::Session.current.use_primary?
read_write do |connection|
get_write_location(connection)
end
else
read do |connection|
get_write_location(connection)
end
end
end
end
We found two problems with this:
- It changes semantics because we would execute a different query to obtain the WAL for a replica (compare
get_write_location
on LB withdatabase_replica_location
onHost
). We are not confident that this would be correct. - The
get_write_location
method contains a fairly dated environment toggle that switches the query, complicating this solution further. We are not sure whether this should still apply or not but would rather address that in a separate issue/MR.
Open questions
-
Is it OK to remove wal_location_source
? Nope, we added it back. -
Could we simplify this further by always reading primary_wal_location
? There is evidence that this is what we do anyway 90% of the time:
We decided to revisit this in a follow-up: #385255
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
- When on
master
, enable load balancing and start sidekiq and puma - Shut down all read replicas
- Schedule a Sidekiq job that uses
data_consistency: :delayed
(I patchedChaos::DbSpinWorker
to test this) - It will crash with a NPE since the host will be
nil
Check out this branch, repeat, the problem should disappear.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #348510 (closed)