The `use_model_load_balancing` results in a wrong sticking context used (!73949) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Fixes bug introduced by !73631 (merged) in use_model_load_balancing.

When doing use_model_load_balancing a wrong sticking context is used. This solves a subtle bug with sticking when use_model_load_balancing is used.

The problem was that for Web/Sidekiq:

If prior request would have use_model_load_balancing=false it would see load balancers to be main/main. This would result in sticking context to include only main.
If next request would evaluate use_model_load_balancing=true it would see load balancers for main/ci, but would only for wal locations see main=lag. As a result we would not check replication lag against ci.
This result in a situation that if ci is way behind main, the 2. would not understand the state of ci. Thus would consider ci to be up-to date and stick to main.

This commit fixes that:

If ci: is configured we always store the ci: in replication lag. This makes us the wal_locations (and RackMiddleware sticking context) to always include main+ci in all cases.
This results would behave (as for number of requests): we are reading primary / replica location from all databases.
Since we always have ci even if use_model_load_balancing=false we can properly evaluate state of replicas.

Related to:

Test

For a case when ci-replica has significantly more lag than main:

development:
  main:
    host: postgres
    load_balancing:
      hosts:
      - postgres
      - postgres
  ci:
    host: postgres
    load_balancing:
      hosts:
      - postgres-replica
      - postgres-replica


def current_wals
  ::Gitlab::Database::LoadBalancing.each_load_balancer.to_h do |lb|
    [lb.name, lb.primary_write_location]
  end
end

def use_replica?(wals)
  ::Gitlab::Database::LoadBalancing.each_load_balancer.all? do |lb|
    if location = wals[lb.name]
      lb.select_up_to_date_host(location)
    else
      true
    end
  end
end

def run_test
  Feature.disable(:use_model_load_balancing)
  Gitlab::SafeRequestStore.begin!
  Gitlab::SafeRequestStore.clear!
  wals = current_wals
  User.last.update(note: rand())

  Feature.enable(:use_model_load_balancing)
  Gitlab::SafeRequestStore.begin!
  Gitlab::SafeRequestStore.clear!
  pp(wals: wals, latest_wals: current_wals, use_replica: use_replica?(wals))
end

# Without this MR
[64] pry(main)> run_test
=> {:wals=>{:main=>"1/8700B170"}, :latest_wals=>{:main=>"1/8700B3C0", :ci=>"1/8700B3C0"}, :use_replica=>true}

# With this MR
[11] pry(main)> run_test
=> {:wals=>{:main=>"1/87008E78", :ci=>"1/87008E78"}, :latest_wals=>{:main=>"1/8700B0D8", :ci=>"1/8700B0D8"}, :use_replica=>false}

It is expected that see use_replica=false in all cases.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 08, 2021 by Kamil Trzciński

The `use_model_load_balancing` results in a wrong sticking context used

What does this MR do and why?

Test

MR acceptance checklist

Merge request reports