Rate limit webhook execution and backoff
Summary
From https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6299#note_835373941
It is possible that there are several thousands of jobs finish at
roughly the same time, causing web hooks to fire. When all these
webhooks fire, we update the status of the WebHook
in the
WebHooks::LogExecutionWorker
. All of these workers will try to
update the same row in the database, causing lock contention.
The update is skipped when it isn't necessary based on the attributes loaded on the webhook when the job started, but because many jobs were running at the same time, this would not prevent us trying the update anyway.
Impact
The concurrent updates to a single row could create a lock contention in the database:
This in turn causes all these jobs to wait, decreasing throughput and increasing the backlog of other jobs.
This shows the total time spent processing jobs and the amount of that time spent in the database, and the correlation with the dip in started jobs.
Recommendation
At the least we should limit the amount of concurrent updates for that single row. But we should probably consider rate-limiting a single webhook for everyone: #337228 (closed)