Multiarch (x86_64 + aarch64) kubernetes cluster, jobs start but fail during preparation on aarch64 nodes

Summary

When running CI jobs for the AARCH64 (ARM64) architecture on my multiarch microk8s kubernetes cluster (x86_64 + aarch64) with gitlab-runners deployed via helm3 chart I get the following failure

Running with gitlab-runner 13.7.0 (943fc252)
  on gitlab-runner-arm64-ubuntu-20-10-gitlab-runner-699b99df6-kcnl2 59_sUxoj
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runners
Using Kubernetes executor with image gitlab.cachegrand.dev:5050/cachegrand/cachegrand-server/ubuntu-2004-gcc:latest ...
Preparing environment
00:10
Waiting for pod gitlab-runners/runner-59suxoj-project-1-concurrent-0gd7b2 to be running, status is Pending
Waiting for pod gitlab-runners/runner-59suxoj-project-1-concurrent-0gd7b2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: unable to upgrade connection: container not found ("helper"). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

The CI jobs are mapped to the runners via tags, the runners are mapped to the k8s nodes with the required architecture via labels.

Steps to reproduce

  • deploy a multi node microk8s cluster with x86_64 and aarch64 nodes
  • deploy a gitlab-runner with helm chart using the template pointed out below
  • trigger a CI job
.gitlab-ci.yml
build_ubuntu2004_arm64_gcc9_debug:
  stage: build
  script:
    - cmake -B build -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DUSE_HASHTABLE_HASH_ALGORITHM_T1HA2=1 -DBUILD_TESTS=1 -DBUILD_INTERNAL_BENCHES=1
    - cmake --build build -- -j $(nproc)
  artifacts:
    paths:
      - build
  cache:
    key: "$CI_COMMIT_REF_SLUG"
    paths:
      - build/
    untracked: true
  tags:
    - ubuntu-2004
    - arm64

Actual behavior

Job fails with the error mentioned above

Expected behavior

The job should start

Relevant logs and/or screenshots

Nothing useful from the logs.

The actual runner is destroyed too fast and I don't have a logging infrastructure configured for this kubernetes cluster because as this cluster for internal and personal use and the master runs on a PI4.

The gitlab-runner pod doesn't produce any useful log information, even in verbose, the logs printed in the UI is the only somewhat useful log I get.

job log
Running with gitlab-runner 13.7.0 (943fc252)
  on gitlab-runner-arm64-ubuntu-20-10-gitlab-runner-699b99df6-kcnl2 59_sUxoj
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runners
Using Kubernetes executor with image gitlab.cachegrand.dev:5050/cachegrand/cachegrand-server/ubuntu-2004-gcc:latest ...
Preparing environment
00:10
Waiting for pod gitlab-runners/runner-59suxoj-project-1-concurrent-0gd7b2 to be running, status is Pending
Waiting for pod gitlab-runners/runner-59suxoj-project-1-concurrent-0gd7b2 to be running, status is Pending
	ContainersNotReady: "containers with unready status: [build helper]"
	ContainersNotReady: "containers with unready status: [build helper]"
ERROR: Job failed (system failure): prepare environment: unable to upgrade connection: container not found ("helper"). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Environment description

I am configuring a microk8s-based multi-arch kubernetes cluster composed by some x64_86 machines, some Raspberry pi4, with 64bit kernel to run in aarch64 mode, and, in a close future, some aarch64 VMs (connected to the cluster via a wireguard VPN).

As my goal is, among other things, to run gitlab-runners to run CI jobs seamlessly on the different architectures and machines/vms with different specs, the combo multiarch distribuited kubernetes cluster + helm deployed gitlab-runners looked like the best and the most flexible combination. As it is an personal cluster:

  • RBAC and HA are not in use
  • the master runs on a raspberry pi4 8GB with ubuntu 20.10 aarch64 and 5.8.0-1010-raspi as kernel
  • the cluster boots via pxe and tftp and the rootfs is exported via NFS, the actual nfs exported folders are a stack of layers exposed via overlayfs
  • containerd has been configured to run on the worker PIs4 nodes using an USB 3.0 NVME disk (ext4) as there are missing features when overlayfs runs over nfs

The gitlab-runners are deployed to the nodes with the proper architecture using the node selectors settings and labels are exposed to map these runners to the pipelines.

Gitlab itself is running of a Digital Ocean droplet and reaches out the kubernetes cluster via a wireguard vpn.

When I start any job that requires an AARCH64 (ARM64) gitlab-runner I get the error posted above.

The image being used by the pipeline (gitlab-ci.yml) is built with docker buildx and is multi-arch, I am using the docker registry provided by my gitlab-ce instance.

There is no connectivity issue between gitlab and the gitlab-runners, infact:

  • the gitlab-runners properly register
  • running a CI job on the gitlab-runner bound to the x86_64 machines work properly
  • manually deploying a gitlab-runner with shell-executor on a raspberry pi4 of the cluster works properly as well
  • I also tried to rebuild the image used by my gitlab-ci.yml directly on the raspberry pi 4 to avoid docker buildx but got the same result
helm-gitlab-runner-arm64-values.yml contents
concurrent: 5
rbac:
  create: true
image: gitlab/gitlab-runner:ubuntu
securityContext:
  fsGroup: 999
  runAsUser: 999
runners:
  privileged: true
  secret: gitlab-runner-secret
  tags: manycore,ubuntu,ubuntu-2004,arm64
  config: |
    concurrent = 5
    [[runners]]
      name = "arm64-ubuntu-20.04"
      environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
      [runners.kubernetes]
        image = "ubuntu:20.04"
        image_pull_secrets = ["gitlab-cachegrand-dev-docker-registry-secret"]
        helper_image = "gitlab/gitlab-runner-helper:arm64-latest"
        allow_privilege_escalation = true
        privileged = true
        dns_policy = "cluster-first-with-host-net"
        [runners.kubernetes.node_selector]
          "kubernetes.io/node-role" = "gitlab-runner"
          "kubernetes.io/arch" = "arm64"
          "kubernetes.io/os" = "linux"
gitlabUrl: https://gitlab.cachegrand.dev/
privileged: true
nodeSelector:
  kubernetes.io/node-role: "gitlab-runner"
  kubernetes.io/os: "linux"
  kubernetes.io/arch: "arm64"
config.toml contents (from the gitlab-runner pod)
listen_address = ":9252"
concurrent = 5
check_interval = 30
log_level = "info"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-runner-arm64-ubuntu-20-10-gitlab-runner-5b6bc778f-fq2hl"
  request_concurrency = 1
  url = "https://gitlab.cachegrand.dev/"
  token = "__HIDDEN__"
  executor = "kubernetes"
  environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "ubuntu:20.04"
    namespace = "gitlab-runners"
    namespace_overwrite_allowed = ""
    privileged = true
    allow_privilege_escalation = true
    image_pull_secrets = ["gitlab-cachegrand-dev-docker-registry-secret"]
    helper_image = "gitlab/gitlab-runner-helper:arm64-latest"
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    dns_policy = "cluster-first-with-host-net"
    [runners.kubernetes.node_selector]
      "kubernetes.io/arch" = "arm64"
      "kubernetes.io/node-role" = "gitlab-runner"
      "kubernetes.io/os" = "linux"
    [runners.kubernetes.affinity]
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]

Used GitLab Runner version

Running with gitlab-runner 13.7.0 (943fc252)
  on gitlab-runner-arm64-ubuntu-20-10-gitlab-runner-699b99df6-kcnl2 59_sUxoj
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runners
Using Kubernetes executor with image gitlab.cachegrand.dev:5050/cachegrand/cachegrand-server/ubuntu-2004-gcc:latest ...

Possible fixes

I fixed the issue specifying the helper image to use (check my helm chart values file), looks like the gitlab-runner-helper images are not built using docker buildx and the default one being used doesn't support arm64.

If the issue is confirmed, a short-term mitigation would be to update the documentation to mention that non-x86_64 archs require a different helper image meanwhile a longer-term mitigation would be to use docker buildx to build multiarch gitlab-runner-helper images (the gitlab-runner image is already built in this way).