Updating merge request has issue with slow Gin Index updates
Initial Slack Thread: https://gitlab.slack.com/archives/C3NBYFJ6N/p1626699464005900
Initial Issue: #327616 (closed)
Looking at the kibana logs, we can see there are a number of slow updates for updating a merge request (https://log.gprd.gitlab.net/goto/97f1e1269b13d3da965fd9a9df4f5009). The interesting thing here is that we see very little locking statements (locking is logged after 5 seconds waiting), which suggests something else is occurring to make these slow.
@msmiley Did further investigation into this, and it appears that it is potentially related to the Gin index on the merge request table. When these are slow enough, this update can cause locking for the merge request while the update finishes.
If the row being updated has to move to a new block (which is likely, since we do not reserve space in old blocks for new row-versions), then most of the 36 indexes would have to be updated in addition to 2 blocks in the table itself. 2 of those indexes are gin index types. That type of index tries to improve the median UPDATE duration by deferring most of the work for most transactions, but an unlucky transaction will have to pay that debt when the backlog reaches a certain threshold. That could potentially be happening here.
I think both lock contention (at the level of rows and index entries) and gin pending-list cleanup are probably playing a role in causing slow UPDATEs on merge_requests. I'm not sure yet what proportion of the slow events are attributable to each cause.
Anecdotally, cleaning the gin index pending-list for the description field on the merge_requests table can cost multiple seconds. The overhead does increase when there are more pending entries to write to the index. In this informal survey of manually running gin_clean_pending_list( 'index_merge_requests_on_description_trigram'::regclass ) the duration varied between 465 ms and 3155 ms.
Part of the variation in those pending-list durations is due to requiring disk IO to fetch the index blocks, but a lot of the wallclock time is spent on-CPU. There are a few ways to dig deeper into this if needed, but even just this informal small sample shows that pending-list cleanup overhead is costing costing hundreds to thousands of milliseconds of stall for random unlucky DML statements on this table.
Switching focus to lock contention...
The pending-list cleanup definitely causes the affected statement (and its transaction) to run for longer than usual. That implicitly means the transaction holds its locks for longer than usual too. Just wanted to note the possibility of interplay between these two issues -- pending-list cleanup can lead to lock contention by extending transaction duration. This effect could be trivial or significant, not sure at this point. Many other things can also cause transactions to be held open for multiple seconds, so this connection is tenuous.
I picked a random one of the lock contention events and pulled all log events involving the blocking transaction. I am not sure how common this particular case is, so for now just treat it as a single data point. We can repeat this by picking a few more blocking transaction ids from the log events emitted by the waiters. In this case, the blocker was an action_cable API call to POST /api/:version/projects/:id/statuses/:sha that updated roughly 2317 rows in the merge_requests table. That UPDATE took 8.6 seconds to complete. While it was running, 2 other transactions were blocked waiting for it to complete -- one waiter was identical to the blocker (same endpoint with the same args), and the other waiter was a sidekiq job MergeRequestMergeabilityCheckWorker trying to do a single-row update on one of the rows being updated by the blocker. It is not clear why the blocker was so slow -- we know this blocker was not itself waiting on lock contention (at least not for 5 seconds), so some other explanation is needed. Gin index maintenance is a possible explanation. https://log.gprd.gitlab.net/goto/34b303948a2e768ba327711b68563cbb
Picking another blocking transaction id again shows the same action_cable endpoint stalling for a while and then blocking an identical call to the same endpoint. https://log.gprd.gitlab.net/goto/1d77c9def870f35b1ff324e70a5ff7c7
Processors that wait:
4 endpoint_id:MergeRequestMergeabilityCheckWorker*/ 10 endpoint_id:MergeRequests::CreatePipelineWorker*/ 12 endpoint_id:MergeWorker*/ 2 endpoint_id:NewMergeRequestWorker*/ 34 endpoint_id:POST /api/:version/projects/:id/statuses/:sha*/ 2 endpoint_id:Projects::MergeRequests::ContentController#widget*/ 2 endpoint_id:Projects::MergeRequestsController#merge*/ 6 endpoint_id:Projects::MergeRequestsController#show*/ 4 endpoint_id:Projects::MergeRequestsController#update*/ 2 endpoint_id:PUT /api/:version/projects/:id/merge_requests/:merge_request_iid*/ 2 endpoint_id:PUT /api/:version/projects/:id/merge_requests/:noteable_id/notes/:note_id*/ 2 endpoint_id:RebaseWorker*/ 10 endpoint_id:UpdateHeadPipelineForMergeRequestWorker*/ 7 endpoint_id:UpdateMergeRequestsWorker*/```
Processors that block:
3 endpoint_id:MergeRequests::CreatePipelineWorker*/ 18 endpoint_id:POST /api/:version/projects/:id/statuses/:sha*/ 1 endpoint_id:Projects::MergeRequests::ContentController#widget*/ 5 endpoint_id:Projects::MergeRequestsController#merge*/ 4 endpoint_id:Projects::MergeRequestsController#show*/ 1 endpoint_id:Projects::MergeRequestsController#update*/ 1 endpoint_id:Projects::NotesController#create*/ 1 endpoint_id:PUT /api/:version/projects/:id/merge_requests/:merge_request_iid*/ 1 endpoint_id:RebaseWorker*/ 5 endpoint_id:UpdateHeadPipelineForMergeRequestWorker*/ 5 endpoint_id:UpdateMergeRequestsWorker*/```
in that survey of 51 blocking transactions, I found the blockers were: 45 UPDATEs 1 INSERT 5 idle in transaction (which were killed after reaching idle-in-transaction timeout; this is what you asked about earlier where the blocker is spending time in the ruby code doing something else while holding its db transaction open)
This continues to smell to me like the long tail of flushing the gin index pending-list. I walked through this in more detail over here, in the context of the notes table's gin index: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4725#note_59614667>5