You need to sign in or sign up before continuing.

CI deployment error related to GitLab Agent

Summary

Deploying Kubernetes deployments from the CI with Terraform (Helm provider) fails intermittently:

Agent Logs

{"level":"info","time":"2022-06-28T10:04:53.206Z","msg":"Observability endpoint is up","mod_name":"observability","net_network":"tcp","net_address":"[::]:8080"}
{"level":"error","time":"2022-06-29T00:02:58.391Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"error","time":"2022-06-29T00:02:58.391Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"warn","time":"2022-06-29T00:02:58.393Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"warn","time":"2022-06-29T06:09:05.632Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"error","time":"2022-06-29T06:09:05.632Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}

It seems that the timestamps of the log entries and the timestamps of the deployment don't seem to correlate with each other.
I didn't find any of the errors / warnings from the log in the troubleshooting docs.
I found #114 (closed) mentioning the same errors / warnings. It's unclear to me how that issue is related and how it was solved.

Things tried

forced the agent to restart with kubectl delete pod, but the problem persists

Severity

The issue is more of an annoyance, because the CI job fails, but the actual deployments of the new / updated applications actually succeed. We can simply retry the CI job and it will then also pass. That is because Terraform sees no difference in the desired vs. the actual state of the deployments.

Context information

using gitlab.com not self-hosted
using shared runners
Cluster Type: AWS EKS
Kubernetes cluster version: v1.22
agent version / image: v15.1.0
- registry.gitlab.com/gitlab-org/cluster-integration/gitlab-agent/agentk:v15.1.0
agent connection status in the GitLab UI is reported as "Connected"

We've recently (end of April 2022) switched from the certificate-based Kubernetes integration to agent-based integration. It's been working fine for two months until yesterday.