Backend: Resolve the problem of GitLab.com API throttling for Slack events API endpoint
See #358676 (comment 909790491) for previous discussion.
About
The Slack link unfurling feature &6389 involves receiving events from Slack each time a link is shared on a Slack workspace that has installed the GitLab Slack app.
The Slack request would normally be considered to be an unauthenticated API request and as such would be rate-limited on GitLab.com to 500 events per minute #358676 (comment 910621468).
It's very likely that we will receive more events than this per minute #358676 (comment 910658578).
If the Slack requests were to be rate-limited, Slack would eventually disable our application's event subscription if:
95% of delivery attempts [fail] within 60 minutes
Which, noted in #358676 (comment 910621468), would mean someone with admin permissions to our GitLab Slack app would need to manually re-enable it. Meanwhile, all features built on the Slack event API would be broken on GitLab.com.
We should find a solution to raise the rate limit of the Slack events API endpoint that balances:
- The need for Slack to send events to a single endpoint.
- Offers protection to Gitlab.com.
- The ability to still throttle the endpoint, configurable live by an SRE if needed to address an incident.
Some things to note
A key aspect of the endpoint that receives Slack events is that it must remain very performant and do very little work in the request. This is mentioned in #358676 (closed) and #358677, and in the Slack docs:
Respond to events with a HTTP 200 OK as soon as you can. Avoid actually processing and reacting to events within the same process. Implement a queue to handle inbound events after they are received.
Another aspect to note is that link unfurling is currently the only event type we plan to receive from Slack, but might not be the last. All event types we subscribe to are handled by the same single endpoint. New features that require subscribing to new Slack event types will add to the number of requests that we receive on this endpoint. We need to consider this in both the solution and also when implementing features in the future.
Proposal
We should first conduct a production readiness review as per https://about.gitlab.com/handbook/engineering/infrastructure/production/readiness/ (mentioned #358676 (comment 910655795)).
At the time of writing the current proposed solution is mentioned #358676 (comment 911151642), but any solution should be reached only after the production readiness review.