Fix NPE when scheduling Sidekiq job and no DB replicas available (!106184) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 348510-fix-lb-nil-wal into master Dec 06, 2022

What does this MR do and why?

We currently write the Write Ahead Location pointer (WAL) to a Sidekiq job hash when load balancing is enabled. This allows Sidekiq workers to make decisions based on how far replicas are lagging behind the primary.

We query the WAL in web workers in the Sidekiq client middleware. In the case where only reads had occurred through the current database connection, we obtain the WAL from a replica provided by the LB. However, in the case where no replica is available at all, this returns nil, which the middleware never accounted for. This will crash the request.

We discussed various approaches to fix this but for now settled on the simplest one because there were open questions as to the semantics and implications of some of the "proper" fixes. See open questions below.

Here we do the simplest thing we could think of that should still provide reliable and correct behavior: fall back to the primary WAL in case no replicas are available.

This MR also removes the wal_location_source log field introduced in !90252 (merged) since that was complicating this fix. The WAL source was only used for debugging and is also of questionable value now that we split the database into main and CI.

Alternative approaches

The drawback of this fix is that it is more of a band-aid; one could argue that it should be the LB who figures this out for us and just provides us with a proper WAL, whether obtained from the primary or not.

Such a solution was outlined here: #348510 (comment 1189653558)

  def read_connection_write_location
     # When only using the primary there's no need for any WAL queries.
    return if primary_only?
    
    if ::Gitlab::Database::LoadBalancing::Session.current.use_primary?
      read_write do |connection|
        get_write_location(connection)
      end
    else
      read do |connection|
        get_write_location(connection)
      end
    end
  end
end

We found two problems with this:

It changes semantics because we would execute a different query to obtain the WAL for a replica (compare get_write_location on LB with database_replica_location on Host). We are not confident that this would be correct.
The get_write_location method contains a fairly dated environment toggle that switches the query, complicating this solution further. We are not sure whether this should still apply or not but would rather address that in a separate issue/MR.

Open questions

Is it OK to remove wal_location_source? Nope, we added it back.
Could we simplify this further by always reading primary_wal_location? There is evidence that this is what we do anyway 90% of the time:

We decided to revisit this in a follow-up: #385255

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

When on master, enable load balancing and start sidekiq and puma
Shut down all read replicas
Schedule a Sidekiq job that uses data_consistency: :delayed (I patched Chaos::DbSpinWorker to test this)
It will crash with a NPE since the host will be nil

Check out this branch, repeat, the problem should disappear.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #348510 (closed)

Edited Dec 09, 2022 by Matthias Käppler

Fix NPE when scheduling Sidekiq job and no DB replicas available