Improve review app debugging output
The debugging output on the review-deploy
job needs some improvement. This could be changing the display_deployment_debug
function
Current process
Here's what the Review App failure process currently looks like for me (all command line). This is very tedious and is manual when it comes to looking at individual helm releases
- Setup local kubectl to connect to the GKE environment
- Look at recent success/failure ratio
helm ls -d
- If lots of failures
- Investigate the Instance Group CPU metrics in GCP
- Investigate the Instance Group group size in GCP
- Look at pipeline schedules to see when
schedule:review-cleanup
ran successfully.
- If low number of failures, dig into some of the failures. See https://helm.sh/docs/using_helm/#helpful-options-for-install-upgrade-rollback for details about Helm waiting for readiness.
- Copy a release (ie review-create-ans-nn0xqi)
- Look at job output to see what Pods were available post timeout -
helm get values review-create-ans-nn0xqi | grep url
and click on job URL - Look for pod list below
Error: release review-create-ans-nn0xqi failed: timed out waiting for the condition
to see what app wasn't ready or available. - Run a describe for resources which were not available to look at events ie.
kubectl describe pod review-create-ans-nn0xqi-unicorn-7fbd495b7f-f9hpj
. The events can help provide a timeline of why a pod wasn't ready or available. You may seeFailedScheduling
orUnhealthy
which can be used with the timeframe to determine when a pod was ready.
There are lots of ways to do these steps and improve them. @gl-quality/eng-prod - Let's iterate towards a better process.
Proposal
I'd like to be able to iterative work towards answering the question of "why are the releases failing" and report across all failures. To do this, we need to work towards some sort of the following.
- Identify which components/batch jobs are not ready when helm timeouts and iteratively solve this by:
- Step 1: Output this information to the console (in progress)
- Step 2: Identify relevant details in StackDriver from Pod events and Tiller logs for helm release upgrade failures.
- Step 3: Build notification around failure
Edited by Kyle Wiebers