It's been an issue that webhooks in https://gitlab.com/gitlab-org we're using for the service can be auto-disabled. In the original issue we're talking about raising visibility when that happened. This issue focuses on making sure it's always enabled without human intervention.
Figuring out if there's any bugs to auto-disable webhooks, because it doesn't seem to add up that it should be auto-disabling it so often: gitlab-org/gitlab#396577 (comment 1433229017)
Use an owner token to automate re-enabling the webhooks:
Periodically, or
When we detected that it's likely going down
We can do both!
Change the product so enabling group webhooks doesn't need to have owner permission, but maintainer permission, so we can use a maintainer token for this
Make it able to disable auto-disabling for this case
Figuring out if there's any bugs to auto-disable webhooks, because it doesn't seem to add up that it should be auto-disabling it so often: gitlab-org/gitlab#396577 (comment 1433229017)
I'm working on adding logging to the auto-disabling behavior (I'm surprised there is no logging at all already)!
I would guess this is the root culprit because this can explain the webhook being disabled "a bit randomly".
Let's wait and see if this will ever happen again. If this is stable enough, I think we can stop here
@rymai Thank you so much for getting to this. I think my lesson here is what you said on Slack:
Example of why logging is useful, and why the EP team can contribute in improving observability in the product in order to improve our own tools.
And:
Yes, I think we shouldn’t be afraid to contribute to the product sometimes, as it can unblock us as well as allow improve long-term quality of features.
Change the product so enabling group webhooks doesn't need to have owner permission, but maintainer permission, so we can use a maintainer token for this
That would definitely make sense!
Lin Jen-Shinchanged the descriptionCompare with previous version
Use an owner token to automate re-enabling the webhooks:
Periodically, or
When we detected that it's likely going down
We can do both!
In the team meeting, @splattael mentioned that we can hook into when Kubernetes decides to restart or move pods around, trigger a CI job to re-enable the webhooks again. Specifically, that's:
When we detected that it's likely going down
I am not sure if we can hook into that, but we should be able to trigger the CI job when the application is starting, which we can run whatever we want. If Kubernetes is restarting or moving the application pods, which should start the application then it can trigger the CI job. This should help because that's likely when the webhook can be disabled, based on the last incident: #1417 (comment 1616398921)
One item above the idea in the Google docs agenda, @rymai mentioned that we can create a private project to use a group token, enabling the webhooks periodically. So we can do both!
This should be a cheap API call if there's an API, so I think we can do both and that would be the most reliable workaround.
I think it really shouldn't disable the webhook so fast: (Screenshot from @stanhu)
Quoting myself:
I think this is really the problem. Even with 99.992% uptime the webhook still thought the service was broken and disabled it. I don't think this is reasonable at all. How many services have 99.992% uptime?
I think another evidence that there's some bugs is that, gitlab-com webhooks didn't get disabled so often. The same service, one got disabled so much more often than the other, doesn't feel right.
And if that doesn't work, we'll ask for getting more owners for gitlab-org (Thanks @yanguo1 for this!)
Not sure who we should grant owner permission, but maybe 2 AMER timezone and at least one more EMEA timezone, so we have 2 for both timezone. (I am more early EMEA timezone than APAC)
@splattael Thank you! Would you mind if I pass this to you? I don't think we need to rush this at all, and if we can disable that for gitlab-org, let's do that because that's completely solving the problem for us, and I think that's way way better than any alternatives for us. (The best is still finding and fixing the bug, but I am not feeling like digging into that )
gitlab-org/gitlab!151551 (merged) Just hit canary so we can now create a custom role which will allow “us” to manage webhooks for GitLab-org without being group owners
Who wants to set it up? (You need to be group owner to create the custom role)
I think to start with, we can get everyone on EP to have this custom role? @leetickett-gitlab should members from contributor success also receive this custom role?
The only thing I wonder if we should consider using group shares instead of individuals? That way when people join/leave teams we won't have additional admin steps of adding/removing permissions?
should members from contributor success also receive this custom role?
I would certainly appreciate it- especially as we would see the banner when the webhook is disabled (actually, we wouldn't because the permission check for the banner is wrong... i'll spin up an MR for that)
Ooops- I didn't submit this in time!
@rymai I think you updated each individual's role, wdyt of switching to use group shares?
@rymai Is it possible to create a group access token with this new custom role so that we can use the access token to constantly try to activate the webhook from triage-ops?
If we can, perhaps that would be the quickest way to mitigate this?
Not sure if we can give this to a group but I'll see. Edited: No, I can't. When I tried to share to a group, there's no custom roles I can pick. That option only shows up when inviting an individual.
I can't share to a group with a custom role, so I updated the roles for each individual based on the members of https://gitlab.com/gl-quality/eng-prod except for whom are already an owner. I didn't mean to downgrade even though the very reason that I was given owner permission is now not needed. (It's convenient to be honest)
Additionally, I also set this custom role for @leetickett-gitlab because even if we don't give this role to Contributor Success, I think he should also have this permission given that he's often on top of this and the contributor to this great feature! I am sure it'll be useful.
gitlab-org/gitlab!151551 (merged) Just hit canary so we can now create a custom role which will allow “us” to manage webhooks for GitLab-org without being group owners