Skip to content

Monitor database transaction activity for Rails

Andrew Newdigate requested to merge transaction-metrics into master

This change adds metrics for database transactions and Gitaly n+1 blocks.

Requires https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29903

What does this MR do?

This changes comes out of gitlab-com/gl-infra/production#895 (closed), the production slowdown we've been seeing over the past few weeks (possibly months).

One of the strongest correlations with the slowdown was the number of database connections help open in the state idle in transaction. This state occurs when the client opens a transaction block and then performs multiple actions, some of which may not involve the database system.

The problem with this approach is that it is extremely fragile.

In a high volume system, it is critical that these transaction blocks are cleared as quickly as possible. The longer they are held open, the more impact they will have on scaling. This is for two reasons:

  1. Locks on the database. The impact of this is seen most on the internal_ids table, but this contention can happen elsewhere too
  2. Contention for a postgres connection. Pgbouncer, our connection pooler, is responsible for multiplexing many client connections between a small number of server connections. During a database transaction, the client holds the connection open for an extended period of time, during which time, other clients cannot use that connection, even if the client is holding it open for non-database connection reasons.
    • In the worse case scenario, this could lead to complete deadlock, if the service indirectly ends up calling other services which require a database connection, but cannot obtain one due to lack of connections.

This MR adds some new metrics, which are important for understanding the way GitLab scales.

The hope is that this will allow us to target poorly scaling code. Often, a small rearrangement, moving the non-transaction services calls out of the transactional block, is sufficient to resolve the issue.

In future, it would be ideal if attempting to call a Gitaly service from within a transaction results in an exception being thrown (in a similar manner to the Gitaly n+1 detector).

This approach has been attempted in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29774 but it turns out that this anti-pattern is far more prevalent in the code than I had initially expected. So, as a first step, I plan to monitor and assess.

The new metrics added are:

  • gitlab_database_transaction_seconds_bucket A histogram measuring how much time each controller/action pair spends in transaction blocks.
  • gitaly_n_plus_one_duration_seconds_bucket This helps us understand which GitalyClient.allow_n_plus_one blocks are really hurting our server performance, and additionally notifies us if these n+1 blocks are running within a database transaction (which can be very detrimental)
  • gitaly_n_plus_one_calls_total This helps us understand which n+1s are actually generating the most traffic so that they can be prioritised.

image

Does this MR meet the acceptance criteria?

Conformity

Performance and testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Andrew Newdigate

Merge request reports