Gitlab Runner not able to resolve Gitlab URL - Kubernetes Executor
### Status update: 2021-05-13 - We believe the root cause to be the alpine DNS issue for the runner-helper. - This [MR](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/2835) is in-flight to provide a ubuntu flavor of the runner-helper image and is at the maintainer review stage. Eventually, user's will hopefully be able to set: `helper_image_flavor = "ubuntu"` in their runner toml config and this issue will hopefully disappear. ### Summary Gitlab Runner is not able to resolve Gitlab URL. When gitlab & gitlab runners are deployed in kubernetes cluster, gitlab runner intermittently fails with following: ``` Running with gitlab-runner 10.8.0 (079aad9e) on big-runner-gitlab-runner-7bd85f8f65-qqvpj f533f16d Using Kubernetes namespace: gitlab-runner Using Kubernetes executor with image $IMAGE ... Waiting for pod gitlab-runner/runner-f533f16d-project-38-concurrent-4ttnck to be running, status is Pending Waiting for pod gitlab-runner/runner-f533f16d-project-38-concurrent-4ttnck to be running, status is Pending Waiting for pod gitlab-runner/runner-f533f16d-project-38-concurrent-4ttnck to be running, status is Pending Running on runner-f533f16d-project-38-concurrent-4ttnck via big-runner-gitlab-runner-7bd85f8f65-qqvpj... Cloning repository for master with git depth set to 20... Cloning into '/repo/proj'... fatal: unable to access 'https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@code.repo.io/repo/proj.git/': Could not resolve host: code.repo.io /bin/bash: line 114: cd: /repo/proj: No such file or directory ERROR: Job failed: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1 ``` Most of the times it's able to resolve domain name, and it's continues successfully. However, sometimes it fails with above error. It has became quite a pain to retry these jobs. How can I debug what's the issue here? I tried debugging kubernetes DNS. I created a pod every 1 minute and tried to resolve url. It seemed to resolve always. I checked if my DNS server (AWS) is throttling queries, I was able to query 1000 QPS, didn't fail. I added an extra dot in url: `https://code.repo.io.` as mentioned [here](https://gitlab.com/gitlab-org/gitlab-runner/issues/2847) didn't help. ### Steps to reproduce 1. Deploy gitlab on kubernetes, via helm chart: [gitlab](https://gitlab.com/charts/gitlab-rails) 2. Deploy gitlab-runner on kubernetes, via helm chart: [gitlab-runner](https://gitlab.com/charts/gitlab-runner.git) 3. Create a project in gitlab, with `.gitlab-ci.yaml` 4. Run job in CI/CD. Clone fails. ### What is the current *bug* behavior? Runner job fails with: ``` fatal: unable to access 'https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@code.repo.io/repo/proj.git/': Could not resolve host: code.repo.io ``` ### What is the expected *correct* behavior? Clone shouldn't fail due to above failure. ## Possible workarounds - https://gitlab.com/gitlab-org/gitlab-foss/issues/47283#note_155177691 - https://gitlab.com/gitlab-org/gitlab-ce/issues/47283#note_164716469 - https://gitlab.com/gitlab-org/gitlab-runner/issues/4129#note_171501624
issue