gitlab webservice fails to start after node eviction and forced reschedule
Summary
We have encountered an issue where the gitlab webservice fails to recover after a node crashes, and the pod is rescheduled. Specifically the gitlab-webservice-default pod fails while running the dependencies container during initialization.
Steps to reproduce
Simulate a crash by stopping the a worker node that hosts the gitlab-webservice-default pod. Wait for the pod to reschedule.
Configuration used
Here is the values.yaml file we are using to deploy.
global:
hosts:
domain: {{ record_name | default(domain_name,true) }}
https: true
ingress:
# Don't create ingress objects, Istio doesnt use them
enabled: false
configureCertmanager: false
# TODO use isto cert manager
serviceAccount:
name: "gitlab-service-account"
certificates:
image:
repository: local.image.repo.com/gitlab-org/build/cng/alpine-certificates
tag: 20191127-r2
kubectl:
image:
repository: local.image.repo.com/gitlab-org/build/cng/kubectl
tag: 1.13.12
busybox:
image:
repository: local.image.repo.com/library/busybox
tag: latest
psql:
preparedStatements: true
host: "{{ gitlab_global.psql.host }}"
database: "{{ gitlab_global.psql.database }}"
username: {{ gitlab_global.psql.username }}
port: 5432
password:
secret: "{{ gitlab_global.psql.password.secret }}"
key: "{{ gitlab_global.psql.password.key }}"
# TODO mutual tls https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/doc/advanced/external-db/index.md
grafana:
enabled: false
redis:
image:
registry: local.image.repo.com
repository: bitnami/redis
tag: 6.0.9-debian-10-r0
metrics:
image:
registry: local.image.repo.com
repository: bitnami/redis-exporter
tag: 1.12.1-debian-10-r11
global:
size: 8Gi
minio:
image: local.image.repo.com/minio/minio
imageTag: RELEASE.2017-12-28T01-21-00Z
minioMc:
image: local.image.repo.com/minio/mc
tag: RELEASE.2018-07-13T00-53-22Z
persistence:
size: 10Gi
registry:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-container-registry
tag: v3.2.1-gitlab
certmanager:
install: false
gitlab:
gitlab-shell:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-shell
tag: v13.17.0
config:
loginGraceTime: 60
gitaly:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitaly
tag: v13.10.3
persistence:
size: {{ gitlab.gitaly.persistence.size }}
gitlab-exporter:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-exporter
tag: 10.1.0
sidekiq:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-sidekiq-ee
tag: v13.10.3
task-runner:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-task-runner-ee
tag: v13.10.3
webservice:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-webservice-ee
tag: v13.10.3
workhorse:
image: local.image.repo.com/gitlab-org/build/cng/gitlab-workhorse-ee
tag: v13.10.3
service:
externalPort: 443
psql:
password:
secret: "{{ gitlab_global.psql.password.secret }}"
key: {{ gitlab_global.psql.password.key }}
migrations:
image:
repository: local.image.repo.com/gitlab-org/build/cng/gitlab-task-runner-ee
tag: v13.10.3
shared-secrets:
selfsign:
image:
repository: local.image.repo.com/gitlab-org/build/cng/cfssl-self-sign
tag: 1.2
gitlab-runner:
image: local.image.repo.com/gitlab/gitlab-runner:alpine-v13.9.0
install: true
certsSecretName: gitlab-wildcard-tls-chain
checkInterval: 20
concurrent: {{ gitlab_runner.concurrent }} # Max number of runners
runners:
tags: {{ gitlab_runner.runners.tags }}
namespace: gitlab # Bricks itself if not in the same ns
# Postgres will not be included with Gitlab, and be provided externally
postgresql:
install: false
# TODO add connect to external prometheus
prometheus:
install: false
# Ingress is controlled by Istio
nginx-ingress:
enabled: false
Current behavior
The gitlab webservice pod fails to start after. The pod crashes during dependency initialization:
gitlab-webservice-default-78789c4b4b-299tq 0/2 Init:CrashLoopBackOff 73 6h33m
gitlab-webservice-default-78789c4b4b-bk2h2 1/2 CrashLoopBackOff 70 15h
Expected behavior
The gitlab webservice should reschedule without error.
Versions
- Chart: (tagged version | branch | hash
git rev-parse HEAD) - Platform:
- Self-hosted: ansible/kubeadm managed deployment to esxi/vSphere
- Kubernetes: (
kubectl version)- Client: v.1.20.4
[dad-user@bastion ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"archive", BuildDate:"2021-03-18T22:47:51Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"archive", BuildDate:"2021-03-18T09:40:40Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
[dad-user@bastion ~]$
- Server:
- Helm: (
helm version)- Client: v3.3.1
[dad-user@bastion ~]$ helm version
version.BuildInfo{Version:"v3.3.1", GitCommit:"249e5215cde0c3fa72e27eb7a30e8d55c9696144", GitTreeState:"clean", GoVersion:"go1.14.7"}
- Server: n/a
Relevant logs
(Please provide any relevate log snippets you have collected, using code blocks (```) to format)
The gitlab pods:
[dad-user@bastion ~]$ kubectl get pods -n gitlab
NAME READY STATUS RESTARTS AGE
gitlab-db-postgresql-0 1/1 Running 0 6h32m
gitlab-gitaly-0 1/1 Running 0 15h
gitlab-gitlab-exporter-7664684f9-7q2r4 1/1 Running 0 15h
gitlab-gitlab-runner-7674875b7-2njfw 1/1 Running 0 15h
gitlab-gitlab-shell-6c76f487d5-fqxgq 1/1 Running 0 15h
gitlab-gitlab-shell-6c76f487d5-nj7vz 1/1 Running 0 15h
gitlab-import-job-z66hk 0/1 Completed 0 15h
gitlab-migrations-1-t6h97 0/1 Completed 0 15h
gitlab-minio-748d5fd989-s2p4m 1/1 Running 0 15h
gitlab-redis-master-0 2/2 Running 1 15h
gitlab-registry-6b6dc75c6-lsswt 1/1 Running 0 6h37m
gitlab-registry-6b6dc75c6-qvs6c 1/1 Running 0 15h
gitlab-sidekiq-all-in-1-v1-86699b6d96-5627j 1/1 Running 0 15h
gitlab-task-runner-5947468b68-j7dl8 1/1 Running 0 6h37m
gitlab-webservice-default-78789c4b4b-299tq 0/2 Init:CrashLoopBackOff 74 6h37m
gitlab-webservice-default-78789c4b4b-bk2h2 1/2 CrashLoopBackOff 71 15h
[dad-user@bastion ~]$
The pod events:
[dad-user@bastion ~]$ kubectl get events -n gitlab
LAST SEEN TYPE REASON OBJECT MESSAGE
105s Warning BackOff pod/gitlab-webservice-default-78789c4b4b-299tq Back-off restarting failed container
4m44s Warning BackOff pod/gitlab-webservice-default-78789c4b4b-bk2h2 Back-off restarting failed container
The pod logs:
[dad-user@bastion ~]$ kubectl logs -n gitlab gitlab-webservice-default-78789c4b4b-299tq dependencies
+ /scripts/set-config /var/opt/gitlab/templates /srv/gitlab/config
Begin parsing .erb files from /var/opt/gitlab/templates
Writing /srv/gitlab/config/resque.yml
Writing /srv/gitlab/config/cable.yml
Writing /srv/gitlab/config/database.yml
Writing /srv/gitlab/config/gitlab.yml
Copying other config files found in /var/opt/gitlab/templates
Copying smtp_settings.rb into /srv/gitlab/config
+ exec /scripts/wait-for-deps
Checking: resque.yml, cable.yml
+ SUCCESS connecting to 'redis://gitlab-redis-master.gitlab.svc:6379' from cable.yml, through gitlab-redis-master.gitlab.svc
+ SUCCESS connecting to 'redis://gitlab-redis-master.gitlab.svc:6379' from resque.yml, through gitlab-redis-master.gitlab.svc
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
WARNING: Not all services were operational, with data migrations completed.
If this container continues to fail, please see: https://docs.gitlab.com/charts/troubleshooting/index.html#application-containers-constantly-initializing
[dad-user@bastion ~]$
Edited by Jonathan Hill