Fresh eyes code review of ee/lib/gitlab/database/load_balancing
I'm looking for a fresh eyes review of the ee/lib/gitlab/database/load_balancing
code directory. We had an incident this morning that may be related to a bug in this part of the code base. I'd like to do an independent audit on this since it feels like a worthy experiment.
@timzallmann @dhavens @tstadelhofer @cdu1 @bmarnane can you determine who might be able to participate and turn this around in the next couple of days. It's approximately ~1000 lines among the files in this directory.
What I'm looking for:
- Possible bugs found
- Questions around how the code base works
- Any suggestions related to observability we think are worthwhile
This will also give other team members some visibility into a part of the code base they haven't experienced and learn something. Multiple members can review the code, so I would encourage more than one participant.
Summary of possible followups [2019-08-15 Thong]
- https://gitlab.com/gitlab-org/gitlab-ee/issues/13547: Change logging to use structured logging - we can then chart on each application server which Host it thinks it has taken offline and which host is offline
- https://gitlab.com/gitlab-org/gitlab-ee/issues/13548: We should log when we need to try a different host due to query conflicts (https://gitlab.com/gitlab-org/gitlab-ee/blob/4d73f53e7f5c5fdd1989ac2c71d51a6d40dbe3f0/ee/lib/gitlab/database/load_balancing/load_balancer.rb#L42-43)`
-
https://gitlab.com/gitlab-org/gitlab-ee/issues/13549:
LoadBalancing.active_record_models
seems un-used (last usage removed in 96685fc1). Consider removing it -
Question Do we need to connect the pool again once the host comes back online ? (
@pool.disconnect!
is called if a host is taken offline) -
Question In
#online?
, it's a bit misleading to log that a host "came back online" if it never went offline. Perhaps we can change this ? - Investigate I haven't found any obvious bugs that will cause one host to bear most of the read load (corroborated by gitlab-com/gl-infra/production#1054 (comment 204628993)). Perhaps we need to investigate what's going on beyond the application. Not being able to handle a backlog of slow queries I think is a contributing factor.
Followups
~"group::autodevops and kubernetes" will own the following improvements to logging: