Fresh eyes code review of ee/lib/gitlab/database/load_balancing

I'm looking for a fresh eyes review of the ee/lib/gitlab/database/load_balancing code directory. We had an incident this morning that may be related to a bug in this part of the code base. I'd like to do an independent audit on this since it feels like a worthy experiment.

@timzallmann @dhavens @tstadelhofer @cdu1 @bmarnane can you determine who might be able to participate and turn this around in the next couple of days. It's approximately ~1000 lines among the files in this directory.

What I'm looking for:

Possible bugs found
Questions around how the code base works
Any suggestions related to observability we think are worthwhile

This will also give other team members some visibility into a part of the code base they haven't experienced and learn something. Multiple members can review the code, so I would encourage more than one participant.

cc: @stanhu @glopezfernandez

Summary of possible followups [2019-08-15 Thong]

https://gitlab.com/gitlab-org/gitlab-ee/issues/13547: Change logging to use structured logging - we can then chart on each application server which Host it thinks it has taken offline and which host is offline
https://gitlab.com/gitlab-org/gitlab-ee/issues/13548: We should log when we need to try a different host due to query conflicts (https://gitlab.com/gitlab-org/gitlab-ee/blob/4d73f53e7f5c5fdd1989ac2c71d51a6d40dbe3f0/ee/lib/gitlab/database/load_balancing/load_balancer.rb#L42-43)`
https://gitlab.com/gitlab-org/gitlab-ee/issues/13549: LoadBalancing.active_record_models seems un-used (last usage removed in 96685fc1). Consider removing it
Question Do we need to connect the pool again once the host comes back online ? (@pool.disconnect! is called if a host is taken offline)
Question In #online?, it's a bit misleading to log that a host "came back online" if it never went offline. Perhaps we can change this ?
Investigate I haven't found any obvious bugs that will cause one host to bear most of the read load (corroborated by gitlab-com/gl-infra/production#1054 (comment 204628993)). Perhaps we need to investigate what's going on beyond the application. Not being able to handle a backlog of slow queries I think is a contributing factor.

Followups

~"group::autodevops and kubernetes" will own the following improvements to logging:

Edited Oct 27, 2021 by 🤖 GitLab Bot 🤖