Some CI jobs hang, while others complete, and eventually none get picked up by the runner
Summary
Some CI jobs hang, while others complete, and eventually none get picked up by the runner.
Steps to reproduce
Start several jobs that use the following .gitlab-ci.yml
code leading to jobs that hang, then re-try any job that previously succeeded.
.gitlab-ci.yml
---
image: linuxserver/yq
filter_features_core:
stage: build
script:
- wget --no-verbose --output-document "features.yml" "https://gitlab.com/gitlab-com/www-gitlab-com/-/raw/release-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}/data/features.yml"
- yq eval '.features[] | select(.gitlab_core == true)' features.yml > version-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}-core_features.yml
artifacts:
paths:
- version-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}-core_features.yml
By removing some of the code from .gitlab-ci.yml
, I was able to get similar jobs in the same repo to run successfully at first:
diff for .gitlab-ci.yml
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index 94416c8..569ee42 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -1,23 +1,6 @@
---
-stages:
- - build
- - test
-
yamllint:
stage: test
image: registry.gitlab.com/pipeline-components/yamllint:latest
script:
- yamllint --strict -f colored .
-
-filter_features_core:
- stage: build
- image: mikefarah/yq
- script:
- - wget --no-verbose --output-document "features.yml" "https://gitlab.com/gitlab-com/www-gitlab-com/-/raw/release-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}/data/features.yml"
- - yq eval '.features[] | select(.gitlab_core == true)' features.yml > version-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}-core_features.yml
-
- artifacts:
- paths:
- - version-${CI_SERVER_VERSION_MAJOR}-${CI_SERVER_VERSION_MINOR}-core_features.yml
-
-
(Although I'm not sure what is the bug is in the above CI code, if there is one).
Actual behavior
While jobs are still reaching the runner, the CI jobs with the unpatched .gitlab-ci.yml
file get stuck and hang here:
Running with gitlab-runner 16.10.0 (81ab07f6)
on gitlab-runner-5cbc5b7447-lgqpn CR2k7x2e, system ID: r_Zjjjf0oWY9PN
Preparing the "kubernetes" executor 00:00
Using Kubernetes namespace: gitlab-managed-apps
Using Kubernetes executor with image linuxserver/yq ...
Using attach strategy to execute scripts...
Preparing environment
Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...
Waiting for pod gitlab-managed-apps/runner-cr2k7x2e-project-2211-concurrent-0-w8st4mr6 to be running, status is Pending
Eventually, when enough stalled jobs have started and have been cancelled, then all jobs, including a re-run of a previously successful one, are stuck at:
This job has not started yet
This job is in pending state and is waiting to be picked by a runner
Expected behavior
The expected behavior is CI for jobs to not hang when there is an issue with the .gitlab-ci.yml
code, but to rather immediately fail, and for the runner to continue accepting jobs regardless of previously hung and cancelled CI jobs.
Relevant logs and/or screenshots
Looking at the runner's jobs info in the admin area, I see the previous jobs, and the page says that the runner was last contacted "just now", but the new jobs are not picked up by the runner, and they remain "pending".
Yesterday, while looking at the Kubernetes pod logs for a runner by using the kubectl logs ...
command, I saw the following:
kubectl logs
Registration attempt 1 of 30
Runtime platform arch=amd64 os=linux pid=14 revision=81ab07f6 version=16.10.0
WARNING: Running in user-mode.
WARNING: The user-mode requires you to manually start builds processing:
WARNING: $ gitlab-runner run
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
Merging configuration from template file "/configmaps/config.template.toml"
WARNING: Support for registration tokens and runner parameters in the 'register' command has been deprecated in GitLab Runner 15.6 and will be replaced with support for authentication tokens. For more information, see https://docs.gitlab.com/ee/ci/runners/new_creation_workflow
Registering runner... succeeded runner=tBaGMUfT
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!
Configuration (with the authentication token) was saved in "/home/gitlab-runner/.gitlab-runner/config.toml"
Runtime platform arch=amd64 os=linux pid=7 revision=81ab07f6 version=16.10.0
Starting multi-runner from /home/gitlab-runner/.gitlab-runner/config.toml... builds=0 max_builds=0
WARNING: Running in user-mode.
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
There might be a problem with your config based on jsonschema annotations in common/config.go (experimental feature):
jsonschema: '/runners/0/Monitoring' does not validate with https://gitlab.com/gitlab-org/gitlab-runner/common/config#/$ref/properties/runners/items/$ref/properties/Monitoring/$ref/type: expected object, but got null
Configuration loaded builds=0 max_builds=10
Metrics server listening address=:9252 builds=0 max_builds=10
[session_server].listen_address not defined, session endpoints disabled builds=0 max_builds=10
Initializing executor providers builds=0 max_builds=10
Checking for jobs... received job=7441 repo_url=https://example.com/leaf-node/pages-test.git runner=CR2k7x2e
Added job to processing list builds=1 job=7441 max_builds=10 project=2169 repo_url=https://example.com/leaf-node/pages-test.git time_in_queue_seconds=2873
Checking for jobs... received job=7440 repo_url=https://example.com/infra/playbooks.git runner=CR2k7x2e
Added job to processing list builds=2 job=7440 max_builds=10 project=239 repo_url=https://example.com/infra/playbooks.git time_in_queue_seconds=2968
Appending trace to coordinator...ok code=202 job=7441 job-log=0-781 job-status=running runner=CR2k7x2e sent-log=0-780 status=202 Accepted update-interval=1m0s
...
Appending trace to coordinator...ok code=202 job=7440 job-log=0-784 job-status=running runner=CR2k7x2e sent-log=0-783 status=202 Accepted update-interval=1m0s
Job succeeded duration_s=6.903232934 job=7441 project=2169 runner=CR2k7x2e
Appending trace to coordinator...ok code=202 job=7441 job-log=0-2237 job-status=running runner=CR2k7x2e sent-log=781-2236 status=202 Accepted update-interval=1m0s
Updating job... bytesize=2237 checksum=crc32:916ecf33 job=7441 runner=CR2k7x2e
Submitting job to coordinator...ok bytesize=2237 checksum=crc32:916ecf33 code=200 job=7441 job-status= runner=CR2k7x2e update-interval=0s
Removed job from processing list builds=1 job=7441 max_builds=10 project=2169 repo_url=https://example.com/leaf-node/pages-test.git time_in_queue_seconds=2873
Job succeeded duration_s=10.38298994 job=7440 project=239 runner=CR2k7x2e
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=200 OK update-interval=0s
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=200 OK update-interval=0s
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=200 OK update-interval=0s
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=200 OK update-interval=0s
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=20
The following was repeated many times, scrolling off of the screen:
WARNING: Appending trace to coordinator... failed code=200 job=7440 job-log= job-status= runner=CR2k7x2e sent-log=784-2908 status=200 OK update-interval=0s
There are issues for similar problems, but none that say both failed
and code=200
together, which seems a bit paradoxical.
Environment description
We are running a self-hosted GitLab 16.10.4 instance using Omnibus, and a GitLab 16.10.0 runner on Kubernetes via the Helm chart. We're using EKS (AWS Kubernetes) to host the runners, and EC2 instances to host GitLab. We have one dedicated runner, that handles all untagged jobs. We're using Helm version 3.14.4 and kubectl
version 1.29/stable.
/etc/gitlab/values.yml contents
---
runnerRegistrationToken: "{{ runner_token.secret }}"
gitlabUrl: "{{ api_external_url.secret }}"
replicas: 1
rbac:
create: true
## Define specific rbac permissions.
# resources: ["pods", "pods/exec", "secrets"]
# verbs: ["get", "list", "watch", "create", "patch", "delete"]
## Run the gitlab-bastion container with the ability to deploy/manage containers of jobs
## cluster-wide or only within namespace
clusterWideAccess: true
## Use the following Kubernetes Service Account name if RBAC is disabled in this Helm chart (see rbac.create)
##
# serviceAccountName: default
## Configure integrated Prometheus metrics exporter
## ref: https://docs.gitlab.com/runner/monitoring/#configuration-of-the-metrics-http-server
metrics:
enabled: true
Used GitLab Runner version
Running with gitlab-runner 16.10.0 (81ab07f6)
on gitlab-runner-5cbc5b7447-lgqpn CR2k7x2e, system ID: r_Zjjjf0oWY9PN