Skip to content

Fix NPE when scheduling Sidekiq job and no DB replicas available

Matthias Käppler requested to merge 348510-fix-lb-nil-wal into master

What does this MR do and why?

We currently write the Write Ahead Location pointer (WAL) to a Sidekiq job hash when load balancing is enabled. This allows Sidekiq workers to make decisions based on how far replicas are lagging behind the primary.

We query the WAL in web workers in the Sidekiq client middleware. In the case where only reads had occurred through the current database connection, we obtain the WAL from a replica provided by the LB. However, in the case where no replica is available at all, this returns nil, which the middleware never accounted for. This will crash the request.

We discussed various approaches to fix this but for now settled on the simplest one because there were open questions as to the semantics and implications of some of the "proper" fixes. See open questions below.

Here we do the simplest thing we could think of that should still provide reliable and correct behavior: fall back to the primary WAL in case no replicas are available.

This MR also removes the wal_location_source log field introduced in !90252 (merged) since that was complicating this fix. The WAL source was only used for debugging and is also of questionable value now that we split the database into main and CI.

Alternative approaches

The drawback of this fix is that it is more of a band-aid; one could argue that it should be the LB who figures this out for us and just provides us with a proper WAL, whether obtained from the primary or not.

Such a solution was outlined here: #348510 (comment 1189653558)

  def read_connection_write_location
     # When only using the primary there's no need for any WAL queries.
    return if primary_only?
    
    if ::Gitlab::Database::LoadBalancing::Session.current.use_primary?
      read_write do |connection|
        get_write_location(connection)
      end
    else
      read do |connection|
        get_write_location(connection)
      end
    end
  end
end

We found two problems with this:

  1. It changes semantics because we would execute a different query to obtain the WAL for a replica (compare get_write_location on LB with database_replica_location on Host). We are not confident that this would be correct.
  2. The get_write_location method contains a fairly dated environment toggle that switches the query, complicating this solution further. We are not sure whether this should still apply or not but would rather address that in a separate issue/MR.

Open questions

  • Is it OK to remove wal_location_source? Nope, we added it back.
  • Could we simplify this further by always reading primary_wal_location? There is evidence that this is what we do anyway 90% of the time:

image__2_

We decided to revisit this in a follow-up: #385255

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

  1. When on master, enable load balancing and start sidekiq and puma
  2. Shut down all read replicas
  3. Schedule a Sidekiq job that uses data_consistency: :delayed (I patched Chaos::DbSpinWorker to test this)
  4. It will crash with a NPE since the host will be nil

Check out this branch, repeat, the problem should disappear.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #348510 (closed)

Edited by Matthias Käppler

Merge request reports