gitlab webservice fails to start after node eviction and forced reschedule

Summary

We have encountered an issue where the gitlab webservice fails to recover after a node crashes, and the pod is rescheduled. Specifically the gitlab-webservice-default pod fails while running the dependencies container during initialization.

Steps to reproduce

Simulate a crash by stopping the a worker node that hosts the gitlab-webservice-default pod. Wait for the pod to reschedule.

Configuration used

Here is the values.yaml file we are using to deploy.

global:
  hosts:
    domain: {{ record_name | default(domain_name,true) }}
    https: true

  ingress:
    # Don't create ingress objects, Istio doesnt use them
    enabled: false
    configureCertmanager: false
    # TODO use isto cert manager

  serviceAccount:
    name: "gitlab-service-account"

  certificates:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/alpine-certificates
      tag: 20191127-r2

  kubectl:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/kubectl
      tag: 1.13.12

  busybox:
    image:
      repository: local.image.repo.com/library/busybox
      tag: latest

  psql:
    preparedStatements: true
    host: "{{ gitlab_global.psql.host }}"
    database: "{{ gitlab_global.psql.database }}"
    username: {{ gitlab_global.psql.username }}
    port: 5432
    password:
      secret: "{{ gitlab_global.psql.password.secret }}"
      key: "{{ gitlab_global.psql.password.key }}"
      # TODO mutual tls https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/doc/advanced/external-db/index.md

  grafana:
    enabled: false

redis:
  image:
    registry: local.image.repo.com
    repository: bitnami/redis
    tag: 6.0.9-debian-10-r0
  metrics:
    image:
      registry: local.image.repo.com
      repository: bitnami/redis-exporter
      tag: 1.12.1-debian-10-r11
  global:
    size: 8Gi

minio:
  image: local.image.repo.com/minio/minio
  imageTag: RELEASE.2017-12-28T01-21-00Z
  minioMc:
    image: local.image.repo.com/minio/mc
    tag: RELEASE.2018-07-13T00-53-22Z
  persistence:
    size: 10Gi

registry:
  image:
    repository: local.image.repo.com/gitlab-org/build/cng/gitlab-container-registry
    tag: v3.2.1-gitlab

certmanager:
  install: false

gitlab:
  gitlab-shell:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-shell
      tag: v13.17.0
    config:
      loginGraceTime: 60
  gitaly:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitaly
      tag: v13.10.3
    persistence:
      size: {{ gitlab.gitaly.persistence.size }}

  gitlab-exporter:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-exporter
      tag: 10.1.0

  sidekiq:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-sidekiq-ee
      tag: v13.10.3

  task-runner:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-task-runner-ee
      tag: v13.10.3

  webservice:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-webservice-ee
      tag: v13.10.3
    workhorse:
      image: local.image.repo.com/gitlab-org/build/cng/gitlab-workhorse-ee
      tag: v13.10.3
    service:
      externalPort: 443
    psql:
      password:
        secret: "{{ gitlab_global.psql.password.secret }}"
        key: {{ gitlab_global.psql.password.key }}

  migrations:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/gitlab-task-runner-ee
      tag: v13.10.3

shared-secrets:
  selfsign:
    image:
      repository: local.image.repo.com/gitlab-org/build/cng/cfssl-self-sign
      tag: 1.2

gitlab-runner:
  image: local.image.repo.com/gitlab/gitlab-runner:alpine-v13.9.0
  install: true
  certsSecretName: gitlab-wildcard-tls-chain
  checkInterval: 20
  concurrent: {{ gitlab_runner.concurrent }} # Max number of runners
  runners:
    tags: {{ gitlab_runner.runners.tags }}
    namespace: gitlab # Bricks itself if not in the same ns

# Postgres will not be included with Gitlab, and be provided externally
postgresql:
  install: false

# TODO add connect to external prometheus
prometheus:
  install: false

# Ingress is controlled by Istio
nginx-ingress:
  enabled: false

Current behavior

The gitlab webservice pod fails to start after. The pod crashes during dependency initialization:

gitlab-webservice-default-78789c4b4b-299tq    0/2     Init:CrashLoopBackOff   73         6h33m
gitlab-webservice-default-78789c4b4b-bk2h2    1/2     CrashLoopBackOff        70         15h

Expected behavior

The gitlab webservice should reschedule without error.

Versions

  • Chart: (tagged version | branch | hash git rev-parse HEAD)
  • Platform:
    • Self-hosted: ansible/kubeadm managed deployment to esxi/vSphere
  • Kubernetes: (kubectl version)
    • Client: v.1.20.4
[dad-user@bastion ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"archive", BuildDate:"2021-03-18T22:47:51Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"archive", BuildDate:"2021-03-18T09:40:40Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
[dad-user@bastion ~]$ 
  • Server:
  • Helm: (helm version)
    • Client: v3.3.1
[dad-user@bastion ~]$ helm version
version.BuildInfo{Version:"v3.3.1", GitCommit:"249e5215cde0c3fa72e27eb7a30e8d55c9696144", GitTreeState:"clean", GoVersion:"go1.14.7"}
  • Server: n/a

Relevant logs

(Please provide any relevate log snippets you have collected, using code blocks (```) to format)

The gitlab pods:

[dad-user@bastion ~]$ kubectl get pods -n gitlab
NAME                                          READY   STATUS                  RESTARTS   AGE
gitlab-db-postgresql-0                        1/1     Running                 0          6h32m
gitlab-gitaly-0                               1/1     Running                 0          15h
gitlab-gitlab-exporter-7664684f9-7q2r4        1/1     Running                 0          15h
gitlab-gitlab-runner-7674875b7-2njfw          1/1     Running                 0          15h
gitlab-gitlab-shell-6c76f487d5-fqxgq          1/1     Running                 0          15h
gitlab-gitlab-shell-6c76f487d5-nj7vz          1/1     Running                 0          15h
gitlab-import-job-z66hk                       0/1     Completed               0          15h
gitlab-migrations-1-t6h97                     0/1     Completed               0          15h
gitlab-minio-748d5fd989-s2p4m                 1/1     Running                 0          15h
gitlab-redis-master-0                         2/2     Running                 1          15h
gitlab-registry-6b6dc75c6-lsswt               1/1     Running                 0          6h37m
gitlab-registry-6b6dc75c6-qvs6c               1/1     Running                 0          15h
gitlab-sidekiq-all-in-1-v1-86699b6d96-5627j   1/1     Running                 0          15h
gitlab-task-runner-5947468b68-j7dl8           1/1     Running                 0          6h37m
gitlab-webservice-default-78789c4b4b-299tq    0/2     Init:CrashLoopBackOff   74         6h37m
gitlab-webservice-default-78789c4b4b-bk2h2    1/2     CrashLoopBackOff        71         15h
[dad-user@bastion ~]$ 

The pod events:

[dad-user@bastion ~]$ kubectl get events -n gitlab 
LAST SEEN   TYPE      REASON                    OBJECT                                               MESSAGE
105s        Warning   BackOff                   pod/gitlab-webservice-default-78789c4b4b-299tq       Back-off restarting failed container
4m44s       Warning   BackOff                   pod/gitlab-webservice-default-78789c4b4b-bk2h2       Back-off restarting failed container

The pod logs:

[dad-user@bastion ~]$ kubectl logs -n gitlab gitlab-webservice-default-78789c4b4b-299tq dependencies
+ /scripts/set-config /var/opt/gitlab/templates /srv/gitlab/config
Begin parsing .erb files from /var/opt/gitlab/templates
Writing /srv/gitlab/config/resque.yml
Writing /srv/gitlab/config/cable.yml
Writing /srv/gitlab/config/database.yml
Writing /srv/gitlab/config/gitlab.yml
Copying other config files found in /var/opt/gitlab/templates
Copying smtp_settings.rb into /srv/gitlab/config
+ exec /scripts/wait-for-deps
Checking: resque.yml, cable.yml
+ SUCCESS connecting to 'redis://gitlab-redis-master.gitlab.svc:6379' from cable.yml, through gitlab-redis-master.gitlab.svc
+ SUCCESS connecting to 'redis://gitlab-redis-master.gitlab.svc:6379' from resque.yml, through gitlab-redis-master.gitlab.svc
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
Database Schema - current: 0, codebase: 20210310111009
NOTICE: Database has not been initialized yet.
WARNING: Not all services were operational, with data migrations completed.
If this container continues to fail, please see: https://docs.gitlab.com/charts/troubleshooting/index.html#application-containers-constantly-initializing
[dad-user@bastion ~]$ 
Edited by Jonathan Hill