Improve review app debugging output

The debugging output on the review-deploy job needs some improvement. This could be changing the display_deployment_debug function

Current process

Here's what the Review App failure process currently looks like for me (all command line). This is very tedious and is manual when it comes to looking at individual helm releases

Setup local kubectl to connect to the GKE environment
Look at recent success/failure ratio helm ls -d
If lots of failures
- Investigate the Instance Group CPU metrics in GCP
- Investigate the Instance Group group size in GCP
- Look at pipeline schedules to see when schedule:review-cleanup ran successfully.
If low number of failures, dig into some of the failures. See https://helm.sh/docs/using_helm/#helpful-options-for-install-upgrade-rollback for details about Helm waiting for readiness.
- Copy a release (ie review-create-ans-nn0xqi)
- Look at job output to see what Pods were available post timeout - helm get values review-create-ans-nn0xqi | grep url and click on job URL
- Look for pod list below Error: release review-create-ans-nn0xqi failed: timed out waiting for the condition to see what app wasn't ready or available.
- Run a describe for resources which were not available to look at events ie. kubectl describe pod review-create-ans-nn0xqi-unicorn-7fbd495b7f-f9hpj. The events can help provide a timeline of why a pod wasn't ready or available. You may see FailedScheduling or Unhealthy which can be used with the timeframe to determine when a pod was ready.

There are lots of ways to do these steps and improve them. @gl-quality/eng-prod - Let's iterate towards a better process.

Proposal

I'd like to be able to iterative work towards answering the question of "why are the releases failing" and report across all failures. To do this, we need to work towards some sort of the following.

Identify which components/batch jobs are not ready when helm timeouts and iteratively solve this by:
- Step 1: Output this information to the console (in progress)
- Step 2: Identify relevant details in StackDriver from Pod events and Tiller logs for helm release upgrade failures.
- Step 3: Build notification around failure

Edited Oct 10, 2019 by Kyle Wiebers