Slow the ramp up of time webhooks are temporarily disabled for, allowing quicker self-healing
About
Auto-disabling of webhooks (enabled for SaaS, and disabled by default for self-managed) is an important part of GitLab's ability to protect resource consumption. Webhooks are triggered millions of times per day on GitLab.com, so having webhooks behave well can make a big impact on system reliability, which benefits all tenants of GitLab.com.
There are 3 settings that are responsible for calculating the period of time that a webhook should be temporarily disabled for:
Threshold | Proposed value |
---|---|
Initial backoff | 1 minute |
Back off growth factor | 2.0 |
Max backoff | 24 hours |
The values are used like this (see code here):
Initial backoff x (Back off growth factor ^ count of previous failed attempts)
This currently leads to the following scenario of how long a temporarily disabled webhook will be disabled (note, after the bugfix !153637 (merged)):
Attempt | Disabled for |
---|---|
1 | 1m |
2 | 2m |
... | ... |
7 | 1h 4m |
8 | 2h 8m |
9 | 4h 16m |
10 | 8h 32m |
11 | 17h 4m |
12 | 24h |
13 | 24h |
... | ... |
Problem
The way we increment the periods is too steep. A webhook can end up disabled for quite long periods reasonably quickly.
The current approach makes it hard for webhooks to self-heal when a webhook receiver recovers from an error in a timely manner. If the receiver is fixed after only 24 hours of failing, it might be another 24 hours before we allow the webhook to attempt again. We should only be disabling a webhook for 24 hours when it's been failing for longer than that.
Proposal
We should "ease" into disabling webhooks for longer periods. Initially, we could disable for similar periods of time as we currently do, but as the number of attempts increases we would begin to slow the rate at which we approach the "max backoff". This should better strike a balance between allowing webhooks to self-heal quicker if the webhook receiver is fixed within a timely manner, but still disabling webhooks that are longer-term offenders for the longer periods.