UX considerations for auto-disabled and rate-limited webhooks

Background

In https://gitlab.com/gitlab-org/gitlab/-/issues/329213 (scenarios 1 + 2) and https://gitlab.com/gitlab-org/gitlab/-/issues/329207 (scenario 3) we've started the groundwork to deal with misbehaving webhooks. These are both behind FFs and not enabled yet on gitlab.com.

Overview

	Scenario	Impact	User action required?
1	Webhook fails with "expected" errors (HTTP 4xx)	Webhook is disabled (after 3 failures)	Yes, verify endpoint and reenable webhook
2	Webhook fails with "unexpected" errors (HTTP 500, network errors, etc.)	Webhook is retried with exponential backoff (starting at 10m, up to 24h)	No, webhook will keep getting retried and recovers if the endpoint stops misbehaving
3	Webhook gets called too frequently	Webhook calls are blocked (for up to 1 minute)	No, webhook recovers after the rate-limit interval has passed

For 1 and 2, we currently store some information about the failures, and also create an entry in web_hook_logs (as with all webhook calls, for both successes and failures) which are already exposed in the UI.

For 3, the webhook call gets silently dropped and we log to auth.log, which is only visible to admins.

Possible UX improvements

Allow resetting failed webhooks

This is definitely needed for 1, and might make sense for 2 as well.

Highlight misbehaving webhooks in the UI

We could mark affected webhooks in the listings shown in the project and group settings, as well as the admin area.

For 1 and 2 we can use the information we store in the DB.

For 3 the rate-limiting state is stored in Redis, but this could be queried as well.

Send notifications

We could consider sending email notifications to relevant administrators/owners, especially for scenario 1 which requires user action.

Design source

✏ Figma project

Edited Feb 28, 2022 by Libor Vanc