Log connection errors with database load balancing, instead of returning nil

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Context

While working with a customer (Internal ZD ticket) having issues with database load balancing, the logs contained errors like this:

"event":"no_secondaries_available","message":"No secondaries were available, using primary instead"

AND

"event":"host_offline","message":"Host is offline after replica status check",

As part of troubleshooting, we tried to work out why the Hosts were being marked as offline. Digging into the code, I found that replica_is_up_to_date? was the area of interest.

A node is considered online if: replication_lag_below_threshold? is true, OR data_is_recent_enough? is true.

data_is_recent_enough? method checks if the number of bytes difference between the primary and replica is below a certain threshold with the replication_lag_size method:

def replication_lag_size
  location = connection.quote(primary_write_location)
  row = query_and_release(<<-SQL.squish)
            SELECT pg_wal_lsn_diff(#{location}, pg_last_wal_replay_lsn())::float
              AS diff
  SQL

  row['diff'].to_i if row.any?
rescue *CONNECTION_ERRORS
  nil
end

If the code reaches a part where it hits a CONNECTION_ERROR, there is no indication to the user that anything is wrong.

In our case, the customer had to modify the codebase temporarily to actually show the error message we were interested in:

irb(main):001:0> ActiveRecord::Base.connection.load_balancer.host_list.hosts.map {|host| host.replication_lag_size}
Traceback (most recent call last):
        5: from (irb):1
        4: from (irb):1:in `map'
        3: from (irb):1:in `block in irb_binding'
        2: from lib/gitlab/database/load_balancing/host.rb:142:in `replication_lag_size'
        1: from lib/gitlab/database/load_balancing/host.rb:10:in `connection'
ActiveRecord::ConnectionNotEstablished (server does not support SSL, but SSL was required)

From the above, we can see that there was an SSL issue that needed to be investigated.

Proposal

Instead of returning nil on a connection error, we should at least log the error into the database_load_balancing.log logfile to assist with troubleshooting.

The 3 places in lib/gitlab/database/load_balancing/host.rb are:

Edited by 🤖 GitLab Bot 🤖