GitLab provides the ability to resend webhook requests in the UI. Documentation here
A customer would like the ability to be able to resend any failed requests using the API so they can resend them programmatically when they have many failures (hundred or more).
Proposal
Add API endpoints that would allow users to resend failed project and group webhook requests. In the UI we call these webhook events.
We can do this by providing two endpoints each to project hooks and group hooks that would add extra value in general to our API besides just resending failed webhook events:
GET /{projects|groups}/:id/hooks/:hook_id/events - this would expose the same data as when you edit a webhook in the UI and select Recent events.
It would allow filtering by the response status (to allow people to see only failures, for example)
It would allow sorting by ascending or descending order
POST /{projects|groups}/:id/hooks/:hook_id/events/:hook_log_id/resend - this would be the same functionality as in the controller when you resend any of the WebHookLog records (resend the webhook event).
This would allow people to resend failed webhook requests in the follow way:
See their webhook events in the API (parity with the UI).
Filter those events to see the failures.
Use the id of any event (including failures) to resend it (parity with the UI).
Control the rate at which those events are resent, and decide which order to send them in.
Implementation
Project hooks endpoints are defined in API::ProjectHooks (lib/api/project_hooks.rb).
Group hooks endpoints are defined in API::GroupHooks (lib/api/group_hooks.rb).
GET /{projects|groups}/:id/hooks/:hook_id/events
This will expose web_hook_logs records for a webhook (WebHook#web_hook_logs).
It must allow filtering by a statuses argument.
statuses would be:
Any 3-digit integer that was an actual status codes, like 404, 200, etc.
Or else a string. Either:
"success" (which we would expand to all valid 2xx codes):
"failure", (which we would expand to all valid 4xx-5xx codes)
We could use Rack::Utils::HTTP_STATUS_CODES.keys to help us build sets of valid status codes for argument validation and for expanding strings to status codes.
statuses must allow multiple statuses passed in as comma-separated (like how people filter issues by labels).
To support this filtering we must first add a new database index to web_hook_logs.response_status. Note, the web_hook_logs table is very large on GitLab.com, so we may need close help/collaboration with Database reviewers.
We must allow sorting by a optional sort argument. This would sort by WebHookLog#create_at (there is an existing index in place).
sort would be either "desc" or "asc".
The default would be "desc".
POST /{projects|groups}/:id/hooks/:hook_id/events/:hook_log_id/resend
This will resend the WebHookLog record.
We must create a new service for resending a webhook event called WebHooks::Events::ResendService and refactor the existing controller logic to also use it.
Because the endpoint will execute a webhook in the request, which makes an external request to a server that might be slow to respond, the endpoint must queue a new worker WebHooks::Events::ResendWorker which would call the new service.
Large SaaS Premium Customer is interested in this feature because their webhook endpoint had downtime for 2 days.
Over the course of those 2 days, GitLab attempted to send over a hundred webhook requests and failed. The customer would like to be able to resend all the failed webhook requests programatically.
@alexkalderimis@.luke If you review the support ticket, this is pretty interesting. There could be a few further improvements required around handling failed or rate limited webhooks.
Notably, it looks like the recent events page itself may have crashed, and as this issue notes, it's difficult to resend failed events once resolved (if there are many failures).
At the moment it is almost impossible to use the user interface. Most requests end with a 500 response.
It is also quite useless to handle single events. In our organization we have more than 1000 projects and we are interested that all events reach the endpoint. In case something fails on our end we need to recover them for data integrity.
It would be nice if there was a way on the UI to resend failed events within a certain time range.
The API could be given similar functionality for automation.
I hope that the suggestion gets some attention as the current solution is not really pleasing.
I just have the feeling that the 500 errors are gone and the UI is more performant. Probably there is already progress. However, it would be nice if the events could be grouped by status and resent in bulk mode.
Really appreciate your feedback on this @u116076! We are paying attention
Would you be open to meeting to discuss your feedback? You could set up a time using my calendly: https://calendly.com/g-hickman/45min. Or if nothing seems to be lining up, let me know and we can coordinate a time that does. Thanks!
Reasonably straightforward, but a lot of parts. The functionality is currently implemented in HookLogActions#retry, and essentially boils down to an invocation hook.execute(hook_log.request_data, hook_log.trigger), given some hook and some hook_log. We would then want to include an endpoint in lib/api/project_hooks.rb (and for group and system hooks) to retry a given event.
Like the audit issue, this one would also benefit from the creation of a service so that the same logic can be called from both the controller and the REST endpoint (even though it is very simple) - there is synergy here, as we probably want to include things like retries in audit events.
The most challenging thing here is deciding how to call the endpoint
do we want to retry all failed events?
How do we prevent invoking a failed event twice?
Do we also need to add API endpoints to read the list of events (the web-hook-logs), so that they can be iterated and retried by ID?
Do we need to store the retry information on the log record so that we don't accidentally execute it multiple times?
Should we be operating GraphQL first?
Given this set of design considerations, I would suggest this is implemented as the following steps:
Create a new service for retrying a web-hook log entry
Add this to the existing controllers
Update the web_hook_logs table to store last_retried_at (a new timestamp column)
Add endpoints to list and retry log entries for projects, groups and system hooks.
@alexkalderimis Great points, these are some of my thoughts:
Do we also need to add API endpoints to read the list of events (the web-hook-logs), so that they can be iterated and retried by ID?
I think a single endpoint that retried all failures would be most useful #372826 (comment 1166621497). Perhaps we could let people scope to an array of web-hook-log IDs if we wanted to give them the ability to cherry-pick which are retried.
Do we need to store the retry information on the log record so that we don't accidentally execute it multiple times?
This is an important consideration! Especially if we allow people to retry all of their failed webhooks there could be a period of time where we haven't yet retried a particular failure, but we will in the future, and in the meantime we wouldn't want it to be requeued for a retry.
We could either write state to web_hook_log record, or perhaps we could use an exclusive lease, at the service, or worker level, to restrict per webhook?
Should we be operating GraphQL first?
@g.hickman What do you think? Webhook support only exists in REST, so it probably makes sense to implement this in REST also.
@g.hickman, @alexkalderimis touched on this in #372826 (comment 1160893071) that the UI lets you resend a single webhook request, but customers would really need the ability to resend many at once. Probably all failures of a single given webhook in one go. Optionally people could specify a date-time range to have only failed webhooks within that range be retried.
We could potentially also let people retry all failures of all webhooks for a given project.
The API would respond quickly with something to say we have queued the retries, and we would do it async.
We'd need to be careful that when we implement it we retry them without causing a massive flood (both to us, and to the receiver) so we should rate limit the retries to something like 10 per second, or so. This would mean that 10,000 failed hooks would retry over the course of about 17 minutes, which seems fair.
We probably want to have a sanity check when retrying the hooks that we're retrying many thousands of hooks against a receiver that is still down, or in trouble and unable to receive them. We could add a kind of kill switch that after 100 or so retry failures within the overall batch that we're retrying, we just stop retrying anymore.
I think that whichever approach to enabling retries is done, it would be useful to design retries around the idea of idempotency keys. There is also standardization effort around it.
Customer has several webhooks ins place which are getting deactivated after some time. To re-enable they need to manually process those failed webhooks.
But this is not feasable for hundreds of projects.
Current solution for this problem: Manually re-enable of the failed webhook in the UI.
hello @g.hickman - can you share an update on the timeline of implementing this functionality in our API to programmatically re-enable failed webhooks? got asked by the referenced customer, if we do have this already in our backlog for one of the next releases.
Hey @manuel.kraft, I think the timeline could shift on this one unfortunately. I'll defer to @m_frankiewicz as she'll be covering Import and Integrations moving forward.
hello @m_frankiewicz - would be great to get your view regarding a timeline on this issue.
if you do have any questions related to this issue, which you would like to discuss with the customer to move the implementation fwd, just let me know in here or via SLACK
ps: thank you @g.hickman for the heads up! appreciate it!
hello @m_frankiewicz - thank you for getting back. please do let me know if you have any questions regarding customer specifics. happy to support you and our teams.
@.luke the issue is in planning breakdown, but the description seems complete and there's weight, so I wonder what's there to add to make it a good experience for @lifez
@m_frankiewicz@lifez I've expanded the proposal to be a complete one. Magda, could you please look at the proposal and share your thoughts? @lifez It's not a simple feature, take a look at the description and let me know your thoughts too!
the difference between failing and failed webhook - could this be more clear? I understand failing webhook is still retried, for the same event, until it succeeds - is it correct? What happens when more new request are done to failing webhook, do they gather in some queue?
what does it mean to re-enable failing webhook? does it mean to retry it before the time it's automatically scheduled for retry?
failed webhook means disabled webhook, right? that mean this webhook won't be re-tried until it's manually re-enabled by sending a test webhook from UI or API?
what happens to failed recent events after the webhook is re-enabled? I understand they need to be manually re-tried in UI, or after this issue is implemented, they could be re-tried with API?
@.luke could you help me understand that? If you agree there's a room to improve the docs, I'll create an issue.
@m_frankiewicz Great questions. I think an issue would be excellent. Here are some answers:
I don't see in docs a part that talks about
GitLab provides the ability to resend webhook requests in the UI.
Technically it's in this section, but it's very obscure!
To repeat the delivery with the same data, select Resend Request.
I think the Recent events section should be taken out of troubleshooting and become its own section, which could be cross-linked to from troubleshooting, to increase its discoverability. And the feature to resend the request could perhaps be a sub-section. (Edit: it kind of has its own section here but that just links to trouble shooting too, and is nested under "Develop webhooks" which it shouldn't be).
the difference between failing and failed webhook - could this be more clear? I understand failing webhook is still retried, for the same event, until it succeeds - is it correct? What happens when more new request are done to failing webhook, do they gather in some queue?
That's good feedback. It's not clear what failing and failed mean. I think you're talking about this section, right? In the section above, we describe these as "temporarily disabled" and "permanently disabled" webhooks. I think it would make sense to continue to use those terms through the document, rather than failing and failed.
So we'd say:
If a webhook is temporarily disabled, a banner displays at the top of the edit page explaining why the webhook is disabled and when it is automatically re-enabled. For example:In the case of a permanently disabled webhook, an error banner is displayed:To re-enable a temporarily or permanently disabled webhook, send a test request. If the test request succeeds, the webhook is re-enabled.
what does it mean to re-enable failing webhook? does it mean to retry it before the time it's automatically scheduled for retry?
Yes that's right.
failed webhook means disabled webhook, right? that mean this webhook won't be re-tried until it's manually re-enabled by sending a test webhook from UI or API?
Yes. If we called it a permanently disabled webhook instead it might make this clearer. It isn't temporarily disabled, so it need you to take an action to re-enable it. Temporarily disabled webhooks will re-enable after a period of time, or you can immediately re-enable it. In both cases, manually re-enabling is done by triggering a test request through the UI or now API, and only if that test request succeeds (returns a 2xx response).
what happens to failed recent events after the webhook is re-enabled? I understand they need to be manually re-tried in UI, or after this issue is implemented, they could be re-tried with API?
They are logged as failed recent events and can be resent through the UI. The resend will send the exact headers and payload of that failed event again, so it should be identical to the request that failed.
@.luke@m_frankiewicz For the statuses filtering, I think we should limit it to only the status_code to prevent exposing unused log to be resending If customer implement workflow GET failure webhook and POST resent webhook
To support this filtering we must first add a new database index to web_hook_logs.response_status. Note, the web_hook_logs table is very large on GitLab.com, so we may need close help/collaboration with Database reviewers.
We already have index by web_hook_id . I think it's enough for us since we still want to add pagination to the GET API.
@lifez Thank you for your reply! What do you mean by "unused log" here
@.luke For example, assume a user changes the webhook URL, and the webhook log receives a 404 status code and a failure category. If we allow filtering by failure, users may misuse the failure filter and unintentionally retrieve 404 webhook logs. To prevent such misuse, I think allowing users to filter logs with a more fine-grained approach, such as by status code, would be sufficient.
@lifez Ah, I see! I think it would be useful to support filtering by a blanket failure (or success) status for the endpoint. In the case of if people want to use that data to resend, they could just resend the ones that aren't 404. But the idea is an optional value-add, so we could implement it without that idea. @m_frankiewicz do you have any thoughts?
I like that distinction! Some failures like 429 (rate limited) are more re-triable than others like 410 (gone) so it still has the problem Phawin raised, but I think people can handle these before resending.
filtering by status codes that includes also 5**, 4**
Yes to the idea of being able to combine them (although we wouldn't need to support 5** if we support status strings like server_failure). so status=404,server_failure would work.
Because the endpoint will execute a webhook in the request, which makes an external request to a server that might be slow to respond, the endpoint must queue a new worker WebHooks::Events::ResendWorker which would call the new service.
@.luke@m_frankiewicz WDYT? if we not queue and let client wait for response. So that, they can handle retries by HTTP status code it receives.
One more thing, do we need to allow clients to resend successful webhook events?
@lifez If you resend the request, does this make a different X-Gitlab-Webhook-UUID? So if resending will reuse X-Gitlab-Webhook-UUID from before, then yes, it could be used as idempotency key.
@m_frankiewicz What do you think about adding Idempotency key. For me resending failed webhook send the same data thus that, it should send the same X-Gitlab-Webhook-UUID
@lifez@m_frankiewicz@mitar The X-Gitlab-Webhook-UUID header is a different value each time we send a webhook request. We have an issue for adding idempotency keys #388692 (closed), which would be to add a new header. I agree that it's a great feature, we don't currently have it and we allow retrying of webhooks through the UI so I feel the lack of an idempotency header shouldn't block our API implementation. But it makes #388692 (closed) even more relevant.
@lifez I want to check with you how you're doing on this. Has the above conversation blocked you? I see you have a couple of MR open. What's your opinion, would it be good to swap and work first on Web Hooks: support idempotency keys (#388692 - closed)? Would you like it?
The analytics instrumentation label is automatically applied when featureaddition and workflowready for development labels are applied. We encourage teams to plan for instrumentation at the time of feature development for new features. This label helps ~group::analytics instrumentation to proactively reach out to teams that require instrumentation support. In case instrumentation does not apply to this feature, please feel free to remove the label.
@lifez On my personal wish list would be Allow all autodisabled webhooks to self-heal (#396577) - it does change our auto-disabling logic, but for the better I believe. @m_frankiewicz what do you think? It would make our auto-disabling friendlier for 4xx error scenarios, but would still be fine on our services:
Any webhook permanently failing with a 4xx error will only trigger once a day, which is negligible in terms of its effect on services.