PG statement timeout when getting old events on very busy webhooks
Summary
On very busy webhooks (800+ events per second), loading older event pages can cause a 500 error. Upon investigation, this is due to the PostgreSQL error canceling statement due to statement timeout. The culprit query seems to be:
SELECT "web_hook_logs".* FROM "web_hook_logs" WHERE "web_hook_logs"."web_hook_id" = $1 AND "web_hook_logs"."created_at" BETWEEN $2 AND $3 ORDER BY "web_hook_logs"."created_at" DESC LIMIT $5 OFFSET $4
Steps to reproduce
- Find or create a webhook that generates 800-1000 events per second (I will add an internal comment with a customer example)
- If you have created a new webhook, you will probably need to wait a few hours before you can see the error.
- Edit the webhook
- Try to view older events by appending
?page=n, increasingnby 100 or 1000 at a time, until you observe the500error. In my testing, I start getting the timeout aroundpage=8000on GitLab.com 4. Alternatively, you can use the Get a list of project webhook events API endpoint, with similar parameters.
Example Project
Since this is happening in a customer environment, I will add an internal comment with more information about an example project.
What is the current bug behavior?
When loading older pages of webhook events, a PostgreSQL timeout causes a 500 error. This is not consistent (sometimes the query comes back in time), but the higher the page, the more likely it is to occur.
This makes troubleshooting busy webhooks much more difficult, since we don't have good visibility into older events if an issue is not immediately spotted.
What is the expected correct behavior?
We should return results quickly regardless of page number (easier said than done, I know).
Relevant logs and/or screenshots
You can see a sample log search for failures of this nature in the last 5 hours (both API and UI) here: https://log.gprd.gitlab.net/app/r/s/4gNtq
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Proposal
See thread from #504101 (comment 2499234344) down.
We would change the project and group webhook pages to use keyset pagination, filtering by time, and possibly by response status.
Add keyset pagination for the project and group webhook logs UI.
The backend would add webhook logs to GraphQL as that API uses keyset pagination (and we can't modify the pagination of the existing REST API log endponts as that would be a breaking change to that API). The GraphQL field would be exposed on both Project and Group. The GraphQL field must limit the max_page_size to 20 (this is the same as the REST API limitation and is an optimisation because web hook log data can be quite large (if we return the response body).
The frontend would be changed to talk to the GraphQL API, and display < Previous and Next > pagination options.
Example of GraphQL design:
New types:
ProjectWebhookGroupWebhookWebhookEvent
New fields added to Project and Group types:
webhook
ProjectHook and GroupHook would implement a GraphQL interface that would add the event and events field, which would use WebhookEvent type.
{
project(...) {
webhook(id: <global ID>) { # A ProjectWebhook. Would expose events booleans like REST API ProjectHook entity https://gitlab.com/gitlab-org/gitlab/-/blob/8304e298f5a009f33ff69c16bf16f52de3def85f/lib/api/entities/project_hook.rb#L7-19
event(id: <global ID>) { } # To fetch a single webhook log (we call a log an "event" in customer-facing terms)
events { # Limited to max_connection_size of 20
nodes {
# Properties would match REST API WebHookLog entity properties https://gitlab.com/gitlab-org/gitlab/-/blob/34caef8e6a9c77a4f943de852825e324c9ebbd4b/lib/api/entities/web_hook_log.rb#L7-15
}
}
}
}
group(...) {
webhook(id: <global ID>) { # A GroupWebHook. Would expose events booleans like REST API GroupHook entity https://gitlab.com/gitlab-org/gitlab/-/blob/4e3a92ff0fdb112d722a518d91b8b83179108781/ee/lib/ee/api/entities/group_hook.rb#L8-22
}
}
}
Allow filtering the project and group webhook logs between two datetimes in the past.
frontend would allow people to choose allow specifying both date and time for the filter.
backend GraphQL fields would accept date arguments to filter the logs by.
And optionally:
- Allowing people to filter logs by
status(for example, filter by all server errors, or specific response codes). In the same way thatstatusargument in the REST API endpoints works. From the discussion above, we wouldn't have an index to back this filter up #504101 (comment 2508390793) (although the REST API endpoints do currently allow filtering bystatus). We could have further conversions with database to see if based on that we could allow filtering also in the UI. It would again help customers a lot when viewing a large number of logs in the UI.