Optimize Redis polling in tunnel finder
In gitlab-com/gl-infra/production#8127 (closed) incident we saw a huge spike of HSCAN commands kas issued to Redis. This command is used in several places, but almost certainly the spike was generated by the tunnel routing code as other places use this command relatively infrequently.
When at least a single tunnel is established from agent_id=X, kas instances that have tunnels from this agent, put records about themselves into Redis. Then, when a request to proxy is received, routing kas looks for other kas instances that have tunnels for that agent_id. It does that by issuing HSCAN for a particular hash key (based on agent_id). That polling happens every 50ms and continues until:
- a tunnel is found.
- a timeout occurs (20 seconds to find a tunnel).
- the client aborts the request.
When there is a tunnel, "polling" is just a single call to Redis. This is the most typical situation.
The polling happens for each incoming request. When multiple requests land on a single kas, we have an opportunity to combine polling for all requests for agent_id=X in a single kas into a single polling process vs one for each incoming request.
The incident was probably caused by a huge volume of incoming requests for a single agent without such an agent connected. So for each of those requests kas was polling redis for a while and that accumulated into a significant number. By "deduplicating"/merging polling we can largely mitigate this problem.