Fix webhooks being temporarily disabled for too long, too quickly
What does this MR do and why?
Webhooks could previously be temporarily disabled for too long, too quickly.
When a webhook is temporarily disabled, it is disabled for a period of time. If after it has auto-enabled the webhook receiver is still returning a 5xx error code, the webhook is then temporarily disabled again and for a longer period of time.
This period increases up until the maximum amount of time (the
MAX_BACKOFF
), which is 1 day.
We have an optimisation that returns MAX_BACKOFF
when we know there
have been a certain number of failed attempts.
This optimisation was configured according to previous logic, and was
now causing a webhook to reach the MAX_BACKOFF
value too quickly.
Instead of the exponential backoff going:
1m, 2m, 4m, 8m, 16m, 32m, 1h 4m, 2h 8m, 4h 16m [and so on, up to to a max of 24h]
It was instead going:
1m, 2m, 4m, 8m, 16m, 32m, 1h 4m, 2h 8m, 24h
This could inhibit the ability of a webhook to "self-heal", where a webhook should have re-enabled itself earlier after the webhook receiver had recovered from an error.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.