CI "failing" due to helm "timeout"
We're seeing consistent evidence that the helm upgrade ...
calls in CI are succeeding, however are often getting a failed status due to exceeding the timeout (default 300
seconds) set on the --wait
flag.
As >90% of the time these deploys actually succeed, need to find a solution to this haggling problem, as the pipeline status is entirely useless while this is occurring.
Addressed so far:
- !15 (merged) Add NGINX Ingress type annotation
- !15 (merged) Activate/adjust ready/live probes for Omnibus
Possible items:
- Increase the timeout (start with
600
) - Disable the
HTTP Load Balancing
cluster add-on feature on the GKE cluster itself, as we're using NGINX internally.
Biggest culprit:
Error creating load balancer (will retry): Failed to ensure load balancer for service helm-charts-win/review-38-sidekiq-1btygm-nginx: failed to create forwarding rule a8112bbd1cef511e79a5642010a9a001: googleapi: Error 400: Invalid value for field 'resource.IPAddress': 'a.b.c.d'. Specified IP address is in-use and would result in a conflict., invalid
Essentially, it appears that because we're trying to deploy complete set of charts, we're making GKE/GCP networking mad (same static IP)