After digging into the Deployment's pod, I found the following interesting error messages in the "Events" tab:
Unable to mount volumes for pod "dns-gitlab-review-app-external-dns-5d997ff5c6-qb7zb_review-apps-ce(06add1c3-87b4-11e9-80a9-42010a800107)": timeout expired waiting for volumes to attach or mount for pod "review-apps-ce"/"dns-gitlab-review-app-external-dns-5d997ff5c6-qb7zb". list of unmounted volumes=[aws-credentials dns-gitlab-review-app-external-dns-token-sj5jm]. list of unattached volumes=[aws-credentials dns-gitlab-review-app-external-dns-token-sj5jm] FailedMount Jun 5, 2019, 7:06:51 PM Jun 6, 2019, 10:51:48 AM 418MountVolume.SetUp failed for volume "dns-gitlab-review-app-external-dns-token-sj5jm" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm Output: Failed to start transient scope unit: Connection timed out FailedMount Jun 5, 2019, 7:05:52 PM Jun 6, 2019, 10:37:59 AM 382
I've now deleted the Deployment and retried a deploy, which recreated the dns-gitlab-review-app-external-dns Deployment automatically but we still have the same mounting problem.
Edited
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
This error is seeming to be a node-level issue. I'm attempting to drain the node, but seeing Cannot evict pod as it would violate the pod's disruption budget. from many of the pods.
Looking into one of these pods, I am seeing unt -t tmpfs tmpfs /var/lib/kubelet/pods/6e682ca5-87db-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/default-token-5vgvd Output: Failed to start transient scope unit: Connection timed out from a registry pod I'm attempting to move. This is occuring on another node (db983b3e-pq36) apart from the original (db983b3e-5sj9).
Node db983b3e-bkkk seems to be able to take the pod I've been attempting to move. I can't ssh to db983b3e-pq36 as it seems GCP can't deploy the SSH keys to it successfully.
I forcibly drained (e.g. kubectl delete pods --field-selector=spec.nodeName=NODE_NAME) & restarted the node over SSH. I've kubectl uncordon'd it as well.
@WarheadsSE Wow, your messages are so useful, I'm learning so much with each of your messages! I cannot thank you enough, I'm going to document all that! Thanks again!
Short answer: we don't. We don't have that control, and we really don't want to get into the business of machine image maintenance.
Thanks for the summary. Do you think we could avoid this in the future by using machines with less resources so that a single node won't receive too many mounts?
I'm thinking it may be needed to reduce the load on an individual system, especially if we're using many mounts. It's becoming apparent we're not hitting system resource limitations in terms of CPU/memory, but rather subsystem limitations exposed by the automation Kubernetes provides.
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/9069fa14-8856-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/default-token-5vgvd --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/9069fa14-8856-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/default-token-5vgvdOutput: Failed to start transient scope unit: Connection timed out
It has taken some time, but it appears that the number of non-Running state containers has dropped from 60 to 50 while Running has climbed to 21, and mounts have climbed to 1026
@WarheadsSE FYI in parallel I've been deleting old Review Apps using helm delete because they were probably stuck in a state where they had no DNS because of the dns-gitlab-review-app-external-dns Deployment issue originally reported here.
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/7859e8ea-886f-11e9-83b7-42010af0001f/volumes/kubernetes.io~secret/default-token-k9j8n --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/7859e8ea-886f-11e9-83b7-42010af0001f/volumes/kubernetes.io~secret/default-token-k9j8nOutput: Failed to start transient scope unit: Connection timed out
It looks like even new nodes (from the new smaller-machines pool) are immediately hitting...
Unable to mount volumes for pod "review-5456-add-c-38s8rz-registry-7cb4797cd5-2mj8j_review-apps-ee(e1357a37-8888-11e9-83b7-42010af0001f)": timeout expired waiting for volumes to attach or mount for pod "review-apps-ee"/"review-5456-add-c-38s8rz-registry-7cb4797cd5-2mj8j". list of unmounted volumes=[registry-secrets default-token-k9j8n]. list of unattached volumes=[registry-server-config registry-secrets etc-ssl-certs default-token-k9j8n]
Error: a release named dns-gitlab-review-app already exists.Run: helm ls --all dns-gitlab-review-app; to check the status of the releaseOr run: helm del --purge dns-gitlab-review-app; to delete it
If you choose to manage your own cluster, project-specific resources will not be created automatically. If you are using Auto DevOps, you will need to explicitly provide the KUBE_NAMESPACE deployment variable that will be used by your deployment jobs, otherwise a namespace will be created for you.
I've added KUBE_NAMESPACE as a CI/CD variable for now but I think the existing namespace specified in the UI (e.g. ) should be respected so I'll open a bug.
The external-dns chart was properly installed now:
** Installing external DNS for domain gitlab-review.app... **** Checking if dns-gitlab-review-app exists in the review-apps-ee namespace... **Deployment status for dns-gitlab-review-app is 1Installing external-dns Helm chartHang tight while we grab the latest from your chart repositories......Skip local chart repository...Successfully got an update from the "stable" chart repositoryUpdate Complete. ⎈ Happy Helming!⎈ NAME: dns-gitlab-review-appLAST DEPLOYED: Tue Jun 11 09:26:35 2019NAMESPACE: review-apps-eeSTATUS: DEPLOYEDRESOURCES:==> v1beta1/DeploymentNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEdns-gitlab-review-app-external-dns 1 1 1 0 0s==> v1/Pod(related)NAME READY STATUS RESTARTS AGEdns-gitlab-review-app-external-dns-5ccf5d76c7-f4swg 0/1 ContainerCreating 0 0s==> v1/SecretNAME TYPE DATA AGEdns-gitlab-review-app-external-dns Opaque 2 0s==> v1/ServiceAccountNAME SECRETS AGEdns-gitlab-review-app-external-dns 1 0s==> v1beta1/ClusterRoleNAME AGEdns-gitlab-review-app-external-dns 0s==> v1beta1/ClusterRoleBindingNAME AGEdns-gitlab-review-app-external-dns 0s==> v1/ServiceNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEdns-gitlab-review-app-external-dns ClusterIP 10.63.248.135 <none> 7979/TCP 0sNOTES:To verify that external-dns has started, run: kubectl --namespace=review-apps-ee get pods -l "app=external-dns,release=dns-gitlab-review-app"
@WarheadsSE I think I've stabilized the Review Apps deployments now. I've created a new node pool of n1-standard-1 machines so that we avoid having too many pods trying to mount too many points. I don't think that will address 100% of the problem but that should at least avoid it in most cases...
All Review Apps are currently properly deployed, and no pods are in a bad state.