Hubble fails to start due to conflict with Cilium
Why are we doing this work
When using Cilium via GMAv2, there is an issue where Hubble sometimes will not start if it was started prior to Cilium. This issue has a workaround where Hubble can be restored to a working state by restarting the pods, and this is documented via !64136 (merged). We need to fix the underlying issue so that the workaround is no longer required. We will upgrade the Cilium chart and see if that solves the problem.
Example of problematic logs:
kubectl logs -n gitlab-managed-apps hubble-relay-5966c56c69-4dmgz -f
level=info msg="Starting server..." options="{hubbleTarget:unix:///var/run/cilium/hubble.sock dialTimeout:5000000000 retryTimeout:30000000000 listenAddress::4245 debug:false observerOptions:[0x1217a10 0x1217ae0]}" subsys=hubble-relay
level=warning msg="Failed to create gRPC client connection to peer gke-gma-test-default-pool-e3f66473-0ppb; next attempt after 10s" address="10.128.0.41:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 10.128.0.41:4244: connect: no route to host\"" subsys=hubble-relay
level=warning msg="Failed to create gRPC client connection to peer gke-gma-test-default-pool-e3f66473-k80g; next attempt after 10s" address="10.128.0.37:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 10.128.0.37:4244: connect: no route to host\"" subsys=hubble-relay
level=warning msg="Failed to create gRPC client connection to peer gke-gma-test-default-pool-e3f66473-ktm7; next attempt after 10s" address="10.128.0.40:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 10.128.0.40:4244: connect: no route to host\"" subsys=hubble-relay
Relevant links
Non-functional requirements
-
Documentation: -
Feature flag: -
Performance: -
Testing:
Implementation plan
-
Update Cilium chart to a more recent version (Version TBD) -
Test to see if it resolves the Hubble issue -
Ensure compatibility with gitlab-agent
Edited by Brian Williams