Skip to content

Rework how reverse tunnel idle timeout works

Mikhail Mazurskiy requested to merge ash2k/remove-timeout into master

Relates to #279 (closed).

Relates to https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure/-/issues/53.

Each agentk replica tries to have at least 2 idle reverse connections open to kas. Once under this threshold, it increments the number of connections by 10. No more than 100 can be open simultaneously. Each connection has an idle timeout of 1 minute i.e. if after a connection has been idle for 1 minute, it is closed. This scale up and down logic allows agentk to scale up/down the number of concurrent reverse tunnels dynamically based on load (was added in !625 (merged)).

The problem is there is a race between kas starting to use a connection and that connection timing out and agentk cancelling it. If these events happen simultaneously, kas gets an error when trying to use such connection. That is what I think we are seeing in those bug reports.

Edited by Mikhail Mazurskiy

Merge request reports