Disambiguate Error Messages for KAS versus GitLab Agent Connectivity

In a cluster where the agent was running fine for a month I started receiving the following errors in CI jobs:

Auto DevOps DAST Deploy:

Cluster Management detect-helm2-releases job:

TCP connectivity to the KAS port from the cluster networking environment was tested as working (but not a simple thing to do without direct cluster access).

Once the agent in the cluster was restored to a working state, both errors resolved.

Both of these errors can easily be taken to mean there is a problem with the Runner talking to KAS - especially the second one where it names the server url.

Over time I think this will lead to a lot of unnecessary root cause troubleshooting.

I think the errors should clearly identify that the connection with KAS is fine and that it is the agent that is not talking.

Even if additional code is needed to support a disambiguated error message I think the reduction in confusion and support contacts will be worthwhile.

Personally I also tend to be an advocate of a lot more "success" logging of critical steps in the process because it allows support escalations to validate and start to isolate the source of the errors from logs alone. I've done this in developer tooling and the number of tickets I could close by log analysis alone went way up.

Proposal

client-go (Go Kubernetes client) sends a timeout value in the URL query parameter (timeout=32s on the screenshot). kas can parse that and close to the timeout (e.g. after 95% of time), right before the request expires, respond with an error, saying that there was a timeout finding a suitable connected agent (or any other conditions that we can detect, not sure what else).

Edited Jun 11, 2022 by Mikhail Mazurskiy