Monitor database transaction activity for Rails
This change adds metrics for database transactions and Gitaly n+1 blocks.
Requires https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29903
What does this MR do?
This changes comes out of gitlab-com/gl-infra/production#895 (closed), the production slowdown we've been seeing over the past few weeks (possibly months).
One of the strongest correlations with the slowdown was the number of database connections help open in the state idle in transaction
. This state occurs when the client opens a transaction block and then performs multiple actions, some of which may not involve the database system.
The problem with this approach is that it is extremely fragile.
In a high volume system, it is critical that these transaction blocks are cleared as quickly as possible. The longer they are held open, the more impact they will have on scaling. This is for two reasons:
- Locks on the database. The impact of this is seen most on the
internal_ids
table, but this contention can happen elsewhere too - Contention for a postgres connection. Pgbouncer, our connection pooler, is responsible for multiplexing many client connections between a small number of server connections. During a database transaction, the client holds the connection open for an extended period of time, during which time, other clients cannot use that connection, even if the client is holding it open for non-database connection reasons.
- In the worse case scenario, this could lead to complete deadlock, if the service indirectly ends up calling other services which require a database connection, but cannot obtain one due to lack of connections.
This MR adds some new metrics, which are important for understanding the way GitLab scales.
The hope is that this will allow us to target poorly scaling code. Often, a small rearrangement, moving the non-transaction services calls out of the transactional block, is sufficient to resolve the issue.
In future, it would be ideal if attempting to call a Gitaly service from within a transaction results in an exception being thrown (in a similar manner to the Gitaly n+1 detector).
This approach has been attempted in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29774 but it turns out that this anti-pattern is far more prevalent in the code than I had initially expected. So, as a first step, I plan to monitor and assess.
The new metrics added are:
-
gitlab_database_transaction_seconds_bucket
A histogram measuring how much time each controller/action pair spends in transaction blocks. -
gitaly_n_plus_one_duration_seconds_bucket
This helps us understand whichGitalyClient.allow_n_plus_one
blocks are really hurting our server performance, and additionally notifies us if thesen+1
blocks are running within a database transaction (which can be very detrimental) -
gitaly_n_plus_one_calls_total
This helps us understand whichn+1s
are actually generating the most traffic so that they can be prioritised.
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation created/updated or follow-up review issue created -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Performance and testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Merge request reports
Activity
mentioned in merge request !29903 (merged)
1 Message 📖 This merge request adds or changes files that require a review from the Database team. Database Review
The following files require a review from the Database team:
lib/gitlab/database.rb
To make sure these changes are reviewed, take the following steps:
- Edit your merge request, and add
gl-database
to the list of Group approvers. - Mention
@gl-database
in a separate comment, and explain what needs to be reviewed by the team. Please don't mention the team until your changes are ready for review.
Reviewer roulette
Changes that require review have been detected! A merge request is normally reviewed by both a reviewer and a maintainer in its primary category (e.g. frontend or backend), and by a maintainer in all other categories.
To spread load more evenly across eligible reviewers, Danger has randomly picked a candidate for each review slot. Feel free to override this selection if you think someone else would be better-suited, or the chosen person is unavailable.
Once you've decided who will review this merge request, mention them as you normally would! Danger does not (yet?) automatically notify them for you.
Category Reviewer Maintainer backend Imre Farkas ( @ifarkas
)Jen-Shin Lin ( @godfat
)Generated by
🚫 DangerEdited by 🤖 GitLab Bot 🤖added 1 commit
- 4074907e - Adds metrics to measure cost of expensive operations.
mentioned in issue gitlab-com/gl-infra/production#895 (closed)
added 1 commit
- 250867b0 - Adds metrics to measure cost of expensive operations.
added 1 commit
- dd9778c7 - Adds metrics to measure cost of expensive operations
@abrandl turns out that disabling all gitaly calls inside transactions is going to be difficult.
Even just disabling known
n+1
s from being executed inside database transactions is going to be difficult as the practice is very prevalent, unfortunately.I'd be interested in getting your take on this as a first step, allow us to assess which controller/actions are the worst offenders, so that we can prioritise their remediation.
- Resolved by Andrew Newdigate
- Resolved by Andrew Newdigate
I'd be interested in getting your take on this as a first step, allow us to assess which controller/actions are the worst offenders, so that we can prioritise their remediation.
@andrewn Capturing the metrics is a great step, thank you!
mentioned in issue gitlab-com/gl-infra/production#915 (closed)
added 1 commit
- f42f66c2 - Adds metrics to measure cost of expensive operations
added Observability backstage [DEPRECATED] database + 1 deleted label
@abrandl I've made your suggested changes. Over to you for first pass on review
😄 - Resolved by Andrew Newdigate
added 1 commit
- cd59a1c8 - Adds metrics to measure cost of expensive operations
- Resolved by Andrew Newdigate
- Resolved by Andreas Brandl
Thanks @andrewn, LGTM for database, I added minor comments. Please pass it on to a backend maintainer for further review and merge.
@godfat over to you for the next round of review.
added 473 commits
-
41197c06...32806aee - 472 commits from branch
master
- 9c7ca72d - Adds metrics to measure cost of expensive operations
-
41197c06...32806aee - 472 commits from branch
added 1 commit
- 2b086241 - Adds metrics to measure database transactions
mentioned in merge request !30022 (closed)
changed milestone to %12.1
- Resolved by Lin Jen-Shin
I'll review this tomorrow, and I hope that's fast enough :)
- Resolved by Lin Jen-Shin
- Resolved by Lin Jen-Shin
added 1 commit
- 56ae34e4 - Adds metrics to measure database transactions
Thanks for the review @godfat. I've made changes as per your comments. Another round of review please!
@andrewn Thank you! Looks good to me.