502 when attempting to delete a web hook due to large number of web hook logs
Summary
User kept getting a 502 when attempting to delete a webhook through the admin panel. The webhook had large number of logs and when attempting to delete it the database connection was terminating due to a large number of destroy jobs on the webhook logs.
rb(main):002:0> SystemHook.find(126)
SystemHook Load (0.6ms) SELECT "web_hooks".* FROM "web_hooks" WHERE "web_hooks"."type" IN ('SystemHook') AND "web_hooks"."id" = 126 LIMIT 1 [["id", 126]]
=> #<SystemHook id: 126, url: "http://...", project_id: nil, created_at: "2017-12-21 00:27:24", updated_at: "2018-04-13 03:41:36", type: "SystemHook", service_id: nil, push_events: true, issues_events: false, merge_requests_events: false, tag_push_events: false, group_id: nil, note_events: false, enable_ssl_verification: false, wiki_page_events: false, token: "", pipeline_events: false, confidential_issues_events: false, repository_update_events: false, job_events: false>
irb(main):003:0> SystemHook.find(126).destroy
SystemHook Load (0.7ms) SELECT "web_hooks".* FROM "web_hooks" WHERE "web_hooks"."type" IN ('SystemHook') AND "web_hooks"."id" = 126 LIMIT 1 [["id", 126]]
(0.3ms) BEGIN
WebHookLog Load (275.6ms) SELECT "web_hook_logs".* FROM "web_hook_logs" WHERE "web_hook_logs"."web_hook_id" = 126 [["web_hook_id", 126]]
SQL (1.3ms) DELETE FROM "web_hook_logs" WHERE "web_hook_logs"."id" = 1495988 [["id", 1495988]]
SQL (0.3ms) DELETE FROM "web_hook_logs" WHERE "web_hook_logs"."id" = 1537182 [["id", 1537182]]
SQL (0.3ms) DELETE FROM "web_hook_logs" WHERE "web_hook_logs"."id" = 1537211 [["id", 1537211]]
[snip about a thousand or so of the same requests ]
SQL (0.3ms) DELETE FROM "web_hook_logs" WHERE "web_hook_logs"."id" = 1537215 [["id", 1537215]]
SQL (0.3ms) DELETE FROM "web_hook_logs" WHERE "web_hook_logs"."id" = 1537222 [["id", 1537222]]
FATAL: terminating connection due to idle-in-transaction timeout
ERROR: server conn crashed?
SQL (0.4ms) DELETE FROM "web_hooks" WHERE "web_hooks"."type" IN ('SystemHook') AND "web_hooks"."id" = 126 [["id", 126]]
(0.1ms) ROLLBACK
ActiveRecord::StatementInvalid: PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/peek-pg-1.3.0/lib/peek/views/pg.rb:17:in `async_exec'
from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/peek-pg-1.3.0/lib/peek/views/pg.rb:17:in `async_exec'
Manual workaround
See #21940 (comment 215143604)
Possible fixes
We currently use a cron job to remove these old webhook logs here https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/workers/remove_old_web_hook_logs_worker.rb. The default deletion is 2 days, but perhaps it would be better to define a hard limit the amount of logs we store when large number could accumulate during those 2 days.
Update:
app/workers/remove_old_web_hook_logs_worker.rb has been removed and replace by https://gitlab.com/gitlab-org/gitlab/blob/master/app/workers/prune_web_hook_logs_worker.rb -> delete older than 90 days
Further discussion for Long-term (90-day) storage for web_hook_logs
-> #23022 (closed)