Allow resending failed webook requests with the API

added GitLab Free GitLab Premium GitLab Ultimate documentation featureaddition groupintegrations [DEPRECATED] sectiondev typefeature + 1 deleted label

Large SaaS Premium Customer is interested in this feature because their webhook endpoint had downtime for 2 days.

Over the course of those 2 days, GitLab attempted to send over a hundred webhook requests and failed. The customer would like to be able to resend all the failed webhook requests programatically.

Of note, this customer is using Webhooks in a group namespace, which is a GitLab Premium feature.

Internal ZD ticket: https://gitlab.zendesk.com/agent/tickets/323060

@alexkalderimis @.luke If you review the support ticket, this is pretty interesting. There could be a few further improvements required around handling failed or rate limited webhooks.

Notably, it looks like the recent events page itself may have crashed, and as this issue notes, it's difficult to resend failed events once resolved (if there are many failures).

@g.hickman &6004 has a few relevant issues:

At the moment it is almost impossible to use the user interface. Most requests end with a 500 response.

It is also quite useless to handle single events. In our organization we have more than 1000 projects and we are interested that all events reach the endpoint. In case something fails on our end we need to recover them for data integrity.

It would be nice if there was a way on the UI to resend failed events within a certain time range. The API could be given similar functionality for automation.

I hope that the suggestion gets some attention as the current solution is not really pleasing.

I just have the feeling that the 500 errors are gone and the UI is more performant. Probably there is already progress. However, it would be nice if the events could be grouped by status and resent in bulk mode.

Really appreciate your feedback on this @u116076! We are paying attention

Would you be open to meeting to discuss your feedback? You could set up a time using my calendly: https://calendly.com/g-hickman/45min. Or if nothing seems to be lining up, let me know and we can coordinate a time that does. Thanks!

Thank you for contacting me. None of the dates fit. Can you make a suggestion for the 21st of November (16:00 CET, Zurich, Switzerland).

added customer label

added to epic &2318

changed milestone to %Backlog

added [deprecated] Accepting merge requests label

changed milestone to %Next 4-6 releases

mentioned in epic &2318

Setting label(s) Category:Integrations based on ~"group::integrations".

added Category:Integrations label

mentioned in issue #375145

mentioned in issue #8068 (closed)

added FY22Q4 label

added devopsmanage label and removed 1 deleted label

added Backlog RefinementIntegrations label

mentioned in issue #381359 (closed)

mentioned in issue #381362 (closed)

mentioned in issue #381364 (closed)

mentioned in issue #381439 (closed)

mentioned in issue #381537 (closed)

Reasonably straightforward, but a lot of parts. The functionality is currently implemented in HookLogActions#retry, and essentially boils down to an invocation hook.execute(hook_log.request_data, hook_log.trigger), given some hook and some hook_log. We would then want to include an endpoint in lib/api/project_hooks.rb (and for group and system hooks) to retry a given event.

Like the audit issue, this one would also benefit from the creation of a service so that the same logic can be called from both the controller and the REST endpoint (even though it is very simple) - there is synergy here, as we probably want to include things like retries in audit events.

The most challenging thing here is deciding how to call the endpoint

do we want to retry all failed events?
How do we prevent invoking a failed event twice?
Do we also need to add API endpoints to read the list of events (the web-hook-logs), so that they can be iterated and retried by ID?
Do we need to store the retry information on the log record so that we don't accidentally execute it multiple times?
Should we be operating GraphQL first?

Given this set of design considerations, I would suggest this is implemented as the following steps:

Create a new service for retrying a web-hook log entry
Add this to the existing controllers
Update the web_hook_logs table to store last_retried_at (a new timestamp column)
Add endpoints to list and retry log entries for projects, groups and system hooks.

@alexkalderimis Great points, these are some of my thoughts:

Do we also need to add API endpoints to read the list of events (the web-hook-logs), so that they can be iterated and retried by ID?

I think a single endpoint that retried all failures would be most useful #372826 (comment 1166621497). Perhaps we could let people scope to an array of web-hook-log IDs if we wanted to give them the ability to cherry-pick which are retried.

Do we need to store the retry information on the log record so that we don't accidentally execute it multiple times?

This is an important consideration! Especially if we allow people to retry all of their failed webhooks there could be a period of time where we haven't yet retried a particular failure, but we will in the future, and in the meantime we wouldn't want it to be requeued for a retry.

We could either write state to web_hook_log record, or perhaps we could use an exclusive lease, at the service, or worker level, to restrict per webhook?

Should we be operating GraphQL first?

@g.hickman What do you think? Webhook support only exists in REST, so it probably makes sense to implement this in REST also.

@g.hickman, @alexkalderimis touched on this in #372826 (comment 1160893071) that the UI lets you resend a single webhook request, but customers would really need the ability to resend many at once. Probably all failures of a single given webhook in one go. Optionally people could specify a date-time range to have only failed webhooks within that range be retried.

We could potentially also let people retry all failures of all webhooks for a given project.

The API would respond quickly with something to say we have queued the retries, and we would do it async.

We'd need to be careful that when we implement it we retry them without causing a massive flood (both to us, and to the receiver) so we should rate limit the retries to something like 10 per second, or so. This would mean that 10,000 failed hooks would retry over the course of about 17 minutes, which seems fair.

I agree, this sounds fair

We probably want to have a sanity check when retrying the hooks that we're retrying many thousands of hooks against a receiver that is still down, or in trouble and unable to receive them. We could add a kind of kill switch that after 100 or so retry failures within the overall batch that we're retrying, we just stop retrying anymore.

Good point! It probably should not work until it's clear webhooks are succeeding again.

Agreed on a kill switch as well.

The retries would be counted against their rate limit https://docs.gitlab.com/ee/administration/instance_limits.html#webhook-rate-limit, but I think that should be fine especially if we "rate limited" the retries by slowing them down a bit #372826 (comment 1166621733). I'd rather they were counted against the customer's rate limit than to write logic into our rate limiter to give retries an exemption.

Issue was refined in #381537 (comment 1160234309). We should split this issue before working on it.

added workflowplanning breakdown label and removed Backlog RefinementIntegrations label

set weight to 5

mentioned in epic &8083

mentioned in merge request !108895 (merged)

mentioned in issue #14888

I think that whichever approach to enabling retries is done, it would be useful to design retries around the idea of idempotency keys. There is also standardization effort around it.

Thanks @mitar - that is an excellent suggestion, worthy I think of its own issue.

marked this issue as related to #388692 (closed)

mentioned in issue #345334 (closed)

marked this issue as related to #355721

mentioned in issue #355721

Link to request: https://gitlab.my.salesforce.com/0014M00001gXuzw

2.300 Seats GitLab Ultimate

Why interested:

Customer has several webhooks ins place which are getting deactivated after some time. To re-enable they need to manually process those failed webhooks.

But this is not feasable for hundreds of projects.

Current solution for this problem: Manually re-enable of the failed webhook in the UI.

Priority: customer priority5 - Based on GitLab Prio Framework

TAM/CSM: @manuel.kraft

CC: @ddornseiff

PM to mention: @g.hickman

Internal ref: https://gitlab.slack.com/archives/C01E55XNCEA/p1674829971427989

hello @g.hickman - can you share an update on the timeline of implementing this functionality in our API to programmatically re-enable failed webhooks? got asked by the referenced customer, if we do have this already in our backlog for one of the next releases.

Hey @manuel.kraft, I think the timeline could shift on this one unfortunately. I'll defer to @m_frankiewicz as she'll be covering Import and Integrations moving forward.

hello @m_frankiewicz - would be great to get your view regarding a timeline on this issue.

if you do have any questions related to this issue, which you would like to discuss with the customer to move the implementation fwd, just let me know in here or via SLACK

ps: thank you @g.hickman for the heads up! appreciate it!

@manuel.kraft apologies, I can't give any indication yet, but will update once I know more. Thanks for the ping!

hello @m_frankiewicz - thank you for getting back. please do let me know if you have any questions regarding customer specifics. happy to support you and our teams.

@manuel.kraft Reading the problem the customer is having, we do now have an endpoint that can trigger a "test" of a webhook https://docs.gitlab.com/ee/api/projects.html#trigger-a-test-project-hook. Triggering a test will re-enable the webhook if the test succeeds (docs).

removed FY22Q4 label

added Category:Webhooks label

changed milestone to %Backlog

added groupimport and integrate label and removed groupintegrations [DEPRECATED] label

We have a large GitLab Ultimate customer that is interested in this feature as they are being impacted by many of these failed webooks.

ZD Ticket (internal)
Salesforce
Customer Priority: priority2
PM: @m_frankiewicz

@m_frankiewicz I would like to contribute this issue.

Amazing! thank you @lifez!

@.luke the issue is in planning breakdown, but the description seems complete and there's weight, so I wonder what's there to add to make it a good experience for @lifez

@m_frankiewicz @lifez I've expanded the proposal to be a complete one. Magda, could you please look at the proposal and share your thoughts? @lifez It's not a simple feature, take a look at the description and let me know your thoughts too!

Thank you @.luke, I could follow well the proposal, it looks complete to me.

I don't see in docs a part that talks about

GitLab provides the ability to resend webhook requests in the UI.

which one is it?

Not related to this issue - it seems to me there's a room for improvements to the docs https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#re-enable-disabled-webhooks, at least it took a while for me to parse this part and I'm not sure I got it right:

the difference between failing and failed webhook - could this be more clear? I understand failing webhook is still retried, for the same event, until it succeeds - is it correct? What happens when more new request are done to failing webhook, do they gather in some queue?
what does it mean to re-enable failing webhook? does it mean to retry it before the time it's automatically scheduled for retry?
failed webhook means disabled webhook, right? that mean this webhook won't be re-tried until it's manually re-enabled by sending a test webhook from UI or API?
what happens to failed recent events after the webhook is re-enabled? I understand they need to be manually re-tried in UI, or after this issue is implemented, they could be re-tried with API?

@.luke could you help me understand that? If you agree there's a room to improve the docs, I'll create an issue.

@m_frankiewicz Great questions. I think an issue would be excellent. Here are some answers:

I don't see in docs a part that talks about

GitLab provides the ability to resend webhook requests in the UI.

Technically it's in this section, but it's very obscure!

To repeat the delivery with the same data, select Resend Request.

I think the Recent events section should be taken out of troubleshooting and become its own section, which could be cross-linked to from troubleshooting, to increase its discoverability. And the feature to resend the request could perhaps be a sub-section. (Edit: it kind of has its own section here but that just links to trouble shooting too, and is nested under "Develop webhooks" which it shouldn't be).

the difference between failing and failed webhook - could this be more clear? I understand failing webhook is still retried, for the same event, until it succeeds - is it correct? What happens when more new request are done to failing webhook, do they gather in some queue?

That's good feedback. It's not clear what failing and failed mean. I think you're talking about this section, right? In the section above, we describe these as "temporarily disabled" and "permanently disabled" webhooks. I think it would make sense to continue to use those terms through the document, rather than failing and failed.

So we'd say:

If a webhook is temporarily disabled, a banner displays at the top of the edit page explaining why the webhook is disabled and when it is automatically re-enabled. For example:

In the case of a permanently disabled webhook, an error banner is displayed:

To re-enable a temporarily or permanently disabled webhook, send a test request. If the test request succeeds, the webhook is re-enabled.

what does it mean to re-enable failing webhook? does it mean to retry it before the time it's automatically scheduled for retry?

Yes that's right.

failed webhook means disabled webhook, right? that mean this webhook won't be re-tried until it's manually re-enabled by sending a test webhook from UI or API?

Yes. If we called it a permanently disabled webhook instead it might make this clearer. It isn't temporarily disabled, so it need you to take an action to re-enable it. Temporarily disabled webhooks will re-enable after a period of time, or you can immediately re-enable it. In both cases, manually re-enabling is done by triggering a test request through the UI or now API, and only if that test request succeeds (returns a 2xx response).

what happens to failed recent events after the webhook is re-enabled? I understand they need to be manually re-tried in UI, or after this issue is implemented, they could be re-tried with API?

They are logged as failed recent events and can be resent through the UI. The resend will send the exact headers and payload of that failed event again, so it should be identical to the request that failed.

Issue created.

@.luke @m_frankiewicz For the statuses filtering, I think we should limit it to only the status_code to prevent exposing unused log to be resending If customer implement workflow GET failure webhook and POST resent webhook

To support this filtering we must first add a new database index to web_hook_logs.response_status. Note, the web_hook_logs table is very large on GitLab.com, so we may need close help/collaboration with Database reviewers.

We already have index by web_hook_id . I think it's enough for us since we still want to add pagination to the GET API.

We may can combine with this issue on GET Recent webhook API access to "Recent Events" for webhooks (#437188 - closed)

to prevent exposing unused log to be resending

@lifez Thank you for your reply! What do you mean by "unused log" here?

We already have index by web_hook_id . I think it's enough for us since we still want to add pagination to the GET API.

We may need another index besides that one that can performantly order by id and created_at, especially if we're filtering a large number of records.

@lifez Thank you for your reply! What do you mean by "unused log" here

@.luke For example, assume a user changes the webhook URL, and the webhook log receives a 404 status code and a failure category. If we allow filtering by failure, users may misuse the failure filter and unintentionally retrieve 404 webhook logs. To prevent such misuse, I think allowing users to filter logs with a more fine-grained approach, such as by status code, would be sufficient.

@lifez Ah, I see! I think it would be useful to support filtering by a blanket failure (or success) status for the endpoint. In the case of if people want to use that data to resend, they could just resend the ones that aren't 404. But the idea is an optional value-add, so we could implement it without that idea. @m_frankiewicz do you have any thoughts?

@.luke @lifez

could we have

3 statuses, success, client failure, server failure or
filtering by status codes that includes also 5**, 4** next to actual status codes like 200 ?

I like that distinction! Some failures like 429 (rate limited) are more re-triable than others like 410 (gone) so it still has the problem Phawin raised, but I think people can handle these before resending.

filtering by status codes that includes also 5**, 4**

Yes to the idea of being able to combine them (although we wouldn't need to support 5** if we support status strings like server_failure). so status=404,server_failure would work.

Because the endpoint will execute a webhook in the request, which makes an external request to a server that might be slow to respond, the endpoint must queue a new worker WebHooks::Events::ResendWorker which would call the new service.

@.luke @m_frankiewicz WDYT? if we not queue and let client wait for response. So that, they can handle retries by HTTP status code it receives.

One more thing, do we need to allow clients to resend successful webhook events?

I have opened 2 MR with basic functionality on project webhook log.

Add retry web hook request API (!151130 - merged)

Add Get Web hook events API (!151048 - merged)

@.luke Could you please skimming each MR? If it going into the right direction. I will refactoring into the module and adding tests. Thank you!

@lifez

if we not queue and let client wait for response. So that, they can handle retries by HTTP status code it receives.

I don't think it's a good idea, as the slow requests would block the web sever.

@m_frankiewicz @.luke We do have async_execute https://gitlab.com/gitlab-community/gitlab/blob/2ab55f1eb1f5e392a224af77da2c80d02832eb38/app/models/hooks/web_hook.rb#L75

But I'm not sure when to use it.

@.luke I assigned you as @lifez MR reviewer, could you take into account this question as well? There's also one below.

Allow resending failed webook requests with the API

Problem to solve

Proposal

Implementation

`GET /{projects|groups}/:id/hooks/:hook_id/events`

`POST /{projects|groups}/:id/hooks/:hook_id/events/:hook_log_id/resend`

Workaround

Designs

Child items ...

Activity

Allow resending failed webook requests with the API

Problem to solve

Proposal

Implementation

GET /{projects|groups}/:id/hooks/:hook_id/events

POST /{projects|groups}/:id/hooks/:hook_id/events/:hook_log_id/resend

Workaround

Relates to

Activity

`GET /{projects|groups}/:id/hooks/:hook_id/events`

`POST /{projects|groups}/:id/hooks/:hook_id/events/:hook_log_id/resend`