CI deployment error related to GitLab Agent
Summary
Deploying Kubernetes deployments from the CI with Terraform (Helm provider) fails intermittently:
Agent Logs
{"level":"info","time":"2022-06-28T10:04:53.206Z","msg":"Observability endpoint is up","mod_name":"observability","net_network":"tcp","net_address":"[::]:8080"}
{"level":"error","time":"2022-06-29T00:02:58.391Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"error","time":"2022-06-29T00:02:58.391Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"warn","time":"2022-06-29T00:02:58.393Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"warn","time":"2022-06-29T06:09:05.632Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
{"level":"error","time":"2022-06-29T06:09:05.632Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: EOF\", received prior goaway: code: NO_ERROR"}
- It seems that the timestamps of the log entries and the timestamps of the deployment don't seem to correlate with each other.
- I didn't find any of the errors / warnings from the log in the troubleshooting docs.
- I found #114 (closed) mentioning the same errors / warnings. It's unclear to me how that issue is related and how it was solved.
Things tried
- forced the agent to restart with
kubectl delete pod
, but the problem persists
Severity
The issue is more of an annoyance, because the CI job fails, but the actual deployments of the new / updated applications actually succeed. We can simply retry the CI job and it will then also pass. That is because Terraform sees no difference in the desired vs. the actual state of the deployments.
Context information
- using gitlab.com not self-hosted
- using shared runners
- Cluster Type: AWS EKS
- Kubernetes cluster version: v1.22
- agent version / image: v15.1.0
- registry.gitlab.com/gitlab-org/cluster-integration/gitlab-agent/agentk:v15.1.0
- agent connection status in the GitLab UI is reported as "Connected"
We've recently (end of April 2022) switched from the certificate-based Kubernetes integration to agent-based integration. It's been working fine for two months until yesterday.