Skip to content

Fix webhooks being temporarily disabled for too long, too quickly

Luke Duncalfe requested to merge ld-webhooks-fix-max-backoff into master

What does this MR do and why?

Webhooks could previously be temporarily disabled for too long, too quickly.

When a webhook is temporarily disabled, it is disabled for a period of time. If after it has auto-enabled the webhook receiver is still returning a 5xx error code, the webhook is then temporarily disabled again and for a longer period of time.

This period increases up until the maximum amount of time (the MAX_BACKOFF), which is 1 day.

We have an optimisation that returns MAX_BACKOFF when we know there have been a certain number of failed attempts.

This optimisation was configured according to previous logic, and was now causing a webhook to reach the MAX_BACKOFF value too quickly.

Instead of the exponential backoff going:

1m, 2m, 4m, 8m, 16m, 32m, 1h 4m, 2h 8m, 4h 16m [and so on, up to to a max of 24h]

It was instead going:

1m, 2m, 4m, 8m, 16m, 32m, 1h 4m, 2h 8m, 24h

This could inhibit the ability of a webhook to "self-heal", where a webhook should have re-enabled itself earlier after the webhook receiver had recovered from an error.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Luke Duncalfe

Merge request reports