CI "failing" due to helm "timeout"

We're seeing consistent evidence that the helm upgrade ... calls in CI are succeeding, however are often getting a failed status due to exceeding the timeout (default 300 seconds) set on the --wait flag.

As >90% of the time these deploys actually succeed, need to find a solution to this haggling problem, as the pipeline status is entirely useless while this is occurring.

Addressed so far:

!15 (merged) Add NGINX Ingress type annotation
!15 (merged) Activate/adjust ready/live probes for Omnibus

Possible items:

Increase the timeout (start with 600)
Disable the HTTP Load Balancing cluster add-on feature on the GKE cluster itself, as we're using NGINX internally.

Biggest culprit:

Error creating load balancer (will retry): Failed to ensure load balancer for service helm-charts-win/review-38-sidekiq-1btygm-nginx: failed to create forwarding rule a8112bbd1cef511e79a5642010a9a001: googleapi: Error 400: Invalid value for field 'resource.IPAddress': 'a.b.c.d'. Specified IP address is in-use and would result in a conflict., invalid

Essentially, it appears that because we're trying to deploy complete set of charts, we're making GKE/GCP networking mad (same static IP)

Edited Nov 27, 2017 by Jason Plum