Log connection errors with database load balancing, instead of returning nil
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Context
While working with a customer (Internal ZD ticket) having issues with database load balancing, the logs contained errors like this:
"event":"no_secondaries_available","message":"No secondaries were available, using primary instead"
AND
"event":"host_offline","message":"Host is offline after replica status check",
As part of troubleshooting, we tried to work out why the Hosts were being marked as offline. Digging into the code, I found that replica_is_up_to_date? was the area of interest.
A node is considered online if:
replication_lag_below_threshold? is true, OR
data_is_recent_enough? is true.
data_is_recent_enough? method checks if the number of bytes difference between the primary and replica is below a certain threshold with the replication_lag_size method:
def replication_lag_size
location = connection.quote(primary_write_location)
row = query_and_release(<<-SQL.squish)
SELECT pg_wal_lsn_diff(#{location}, pg_last_wal_replay_lsn())::float
AS diff
SQL
row['diff'].to_i if row.any?
rescue *CONNECTION_ERRORS
nil
end
If the code reaches a part where it hits a CONNECTION_ERROR, there is no indication to the user that anything is wrong.
In our case, the customer had to modify the codebase temporarily to actually show the error message we were interested in:
irb(main):001:0> ActiveRecord::Base.connection.load_balancer.host_list.hosts.map {|host| host.replication_lag_size}
Traceback (most recent call last):
5: from (irb):1
4: from (irb):1:in `map'
3: from (irb):1:in `block in irb_binding'
2: from lib/gitlab/database/load_balancing/host.rb:142:in `replication_lag_size'
1: from lib/gitlab/database/load_balancing/host.rb:10:in `connection'
ActiveRecord::ConnectionNotEstablished (server does not support SSL, but SSL was required)
From the above, we can see that there was an SSL issue that needed to be investigated.
Proposal
Instead of returning nil on a connection error, we should at least log the error into the database_load_balancing.log logfile to assist with troubleshooting.
The 3 places in lib/gitlab/database/load_balancing/host.rb are:
-
replication_lag_size(mentioned above) database_replica_location-
caught_up?(this returnsfalse, but the same idea is present to log that there was a connection error)