Skip to content

Differentiate metrics and logs from replica/primary databases

What does this MR do?

Closes #323164 (closed) Closes #323165 (closed)

In summary, our instrumentation infrastructure unify all accesses to the databases in the same metrics. In many cases, especially in some recent incidents, it may be extremely useful if we can know the sources of the query, replica utilization, and time spent in each database in a particular request. In fact, in my local development, it turns out that some read-only, safely cacheable queries are done in the primary database. Therefore, it is useful to have those information available in the performance bar as well. This MR is to differentiate the metrics and logs between replica/primary database role. In detail:

  • Introduce db_<role>_count, db_<role>_cached_count and db_<role>_duration_s in web logs and Sidekiq structured logs
  • Introduce gitlab_transaction_db_<role>_count_total prometheus counter
  • Introduce gitlab_sql_<role>_duration_seconds prometheus histogram
  • Add replica, primary tags into Active Record performance bar

All of those features are enabled only if database load balancing is enabled. If it doesn't, no further information is added. It means that this change barely affects self-managed instances.

Solution

Extracted from gitlab-com/gl-infra/scalability#873 (comment 511091812)

When the load balancing is enabled, ActiveRecord::Base is patched so that ActiveRecord::Base#connection returns a Gitlab::Database::LoadBalancing::ConnectionProxy instance wrapping around PostgreSQLAdapter. This proxy redirects read and write statements to corresponding connections (primary/replica). Luckily, ConnectionProxy is patching high-level statements, and fallback all method calls to the original ActiveRecord connection. It means that the connection field in the event payload above is guaranteed to be the connection after replica redirection.

The full pipeline look like this:

User.find(1)
-> call ActiveRecord::Base.connection 
   -> return an instance of Gitlab::Database::LoadBalancing::ConnectionProxy
-> call Gitlab::Database::LoadBalancing::ConnectionProxy#select
   -> call Gitlab::Database::LoadBalancing::LoadBalancer#read or #write
      -> return a ActiveRecord::ConnectionAdapters::PostgreSQLAdapter object from primary or replica host.
   -> call ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#select
   -> call ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache and fiends
      -> Broadcast the instrumentation event
      -> Listeners capture and accumulate the events

If an accessor is not covered by the proxy, for example ActiveRecord::Base.connection.query('select 1'), the proxy fallback to the connection object and the flow stays the same.

As soon as the broadcasted connection objects are proved to be the raw connection objects, we classify the events with confidence. In detail, most of the work is to modify Gitlab::Metrics::Subcribers::ActiveRecord:

  • Store the roles (primary/replica) of each connection after they are retrieved from the load balancer.
  • When listening to sql.active_record event, the metric subscriber calls the global Gitlab::Database::LoadBalancing#db_role method to classify the receiving connection
  • Broadcast metrics and accumulate blogs to use in lograge and structured log

Screenshots (strongly suggested)

New log items in Web JSON logs

Screen_Shot_2021-02-23_at_11.16.37

New log items in API JSON logs

Screen_Shot_2021-02-23_at_11.18.52

New tags in the Active Record section in the performance bar

Screen_Shot_2021-02-23_at_11.17.12

New prometheus metrics

Screen_Shot_2021-02-23_at_11.17.47

Screen_Shot_2021-02-23_at_11.18.11

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Andrew Newdigate

Merge request reports