CI: GKE cluster deployments failing, tiller restarting repeatedly.
Summary
On 2020-06-05, we're seeing many deployments into the GKE cluster fail with transport closing
. Upon inspecting the cluster, there is memory pressure, and tiller-deploy
Pod keeps restarting due to failed timeouts.
CPU: 38
MEM: 91
Current behavior
Most / all deploys are failing.
Expected behavior
Successful deploys
Relevant logs
52m Normal Killing pod/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc-hnjfp Stopping container sidekiq
58m Normal SuccessfulCreate replicaset/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc Created pod: gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc-hnjfp
52m Normal SuccessfulDelete replicaset/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc Deleted pod: gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc-hnjfp
58m Normal SuccessfulRescale horizontalpodautoscaler/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1 New size: 2; reason: cpu resource above target
52m Normal SuccessfulRescale horizontalpodautoscaler/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1 New size: 1; reason: All metrics below target
58m Normal ScalingReplicaSet deployment/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1 Scaled up replica set gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc to 2
52m Normal ScalingReplicaSet deployment/gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1 Scaled down replica set gke-review-jarv-a-a9ggc2-sidekiq-all-in-1-v1-f965cbdbc to 1
24m Normal Killing pod/tiller-deploy-79bc74d4c4-wdw9h Container tiller failed liveness probe, will be restarted
24m Normal Pulled pod/tiller-deploy-79bc74d4c4-wdw9h Container image "gcr.io/kubernetes-helm/tiller:v2.16.1" already present on machine
24m Normal Created pod/tiller-deploy-79bc74d4c4-wdw9h Created container tiller
24m Normal Started pod/tiller-deploy-79bc74d4c4-wdw9h Started container tiller
2m42s Warning Unhealthy pod/tiller-deploy-79bc74d4c4-wdw9h Readiness probe failed: Get http://10.40.0.7:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m51s Warning Unhealthy pod/tiller-deploy-79bc74d4c4-wdw9h Liveness probe failed: Get http://10.40.0.7:44135/liveness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
$ kubectl get pods | awk '/Evicted/{ print $1 }'
gke-production-a4b9oa-webservice-77fb99949c-2kmwq
gke-production-a4b9oa-webservice-77fb99949c-r8thz
gke-production-a4b9oa-webservice-77fb99949c-xswc8
gke-production-a4b9oa-webservice-f684b4465-s6s8n
gke-review-1969-r-00eil7-sidekiq-all-in-1-v1-5c4dcc59f4-wkvd2
gke-review-1969-r-00eil7-webservice-7b7dd876b9-mn7w4
gke-review-1969-r-00eil7-webservice-7b7dd876b9-ssllm
gke-review-4-0-st-vibzom-webservice-7c4bf6b878-sqbpt
gke-review-add-ng-tuoxyy-webservice-558f6579cb-kpphl
gke-review-add-ng-tuoxyy-webservice-558f6579cb-wc27g
gke-review-fix-gi-1947kn-webservice-c8cc48b4f-kmz74
gke-review-fix-gi-1947kn-webservice-d98c4b5bf-jdk66
gke-review-fix-gi-1947kn-webservice-d98c4b5bf-rj8w7
gke-review-shell-lyhycz-sidekiq-all-in-1-v1-6677d9855d-x6m69
gke-review-shell-lyhycz-webservice-5454ffc5dd-8mq9v
gke-review-shell-lyhycz-webservice-5454ffc5dd-gqfss
$ kubectl get pods --field-selector=status.phase=Failed -o json | jq '.items[] | { name: .metadata.name , status: .status.message }'
{
"name": "gke-production-a4b9oa-webservice-77fb99949c-2kmwq",
"status": "The node was low on resource: memory. Container webservice was using 1769760Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-production-a4b9oa-webservice-77fb99949c-r8thz",
"status": "The node was low on resource: memory. Container webservice was using 2051100Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-production-a4b9oa-webservice-77fb99949c-xswc8",
"status": "The node was low on resource: memory. Container webservice was using 2279208Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-production-a4b9oa-webservice-f684b4465-s6s8n",
"status": "The node was low on resource: memory. "
},
{
"name": "gke-review-1969-r-00eil7-sidekiq-all-in-1-v1-5c4dcc59f4-wkvd2",
"status": "The node was low on resource: memory. Container sidekiq was using 1237440Ki, which exceeds its request of 650M. "
},
{
"name": "gke-review-1969-r-00eil7-webservice-7b7dd876b9-mn7w4",
"status": "The node was low on resource: memory. Container webservice was using 2294076Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-1969-r-00eil7-webservice-7b7dd876b9-ssllm",
"status": "The node was low on resource: memory. Container webservice was using 2370740Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-4-0-st-vibzom-webservice-7c4bf6b878-sqbpt",
"status": "The node was low on resource: memory. Container webservice was using 2486892Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-add-ng-tuoxyy-webservice-558f6579cb-kpphl",
"status": "The node was low on resource: memory. Container webservice was using 2327872Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-add-ng-tuoxyy-webservice-558f6579cb-wc27g",
"status": "The node was low on resource: memory. Container webservice was using 2634028Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-fix-gi-1947kn-webservice-c8cc48b4f-kmz74",
"status": "The node was low on resource: memory. Container webservice was using 2509944Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-fix-gi-1947kn-webservice-d98c4b5bf-jdk66",
"status": "The node was low on resource: memory. Container webservice was using 2568720Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-fix-gi-1947kn-webservice-d98c4b5bf-rj8w7",
"status": "The node was low on resource: memory. Container webservice was using 2600368Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-shell-lyhycz-sidekiq-all-in-1-v1-6677d9855d-x6m69",
"status": "The node was low on resource: memory. Container sidekiq was using 1411172Ki, which exceeds its request of 650M. "
},
{
"name": "gke-review-shell-lyhycz-webservice-5454ffc5dd-8mq9v",
"status": "The node was low on resource: memory. Container webservice was using 2767636Ki, which exceeds its request of 1500M. "
},
{
"name": "gke-review-shell-lyhycz-webservice-5454ffc5dd-gqfss",
"status": "The node was low on resource: memory. Container webservice was using 2533308Ki, which exceeds its request of 1500M. "
}