GitLab Functions (steps) fail to run using version 18.7.0-pre

Summary

GitLab Functions (formerly Steps) fail to execute on GitLab Runner version 18.7.0. The issue is not present in version 18.6.6.

Steps to Reproduce

Create a step-runner job with the following configuration:

changelog-format:
  stage: test
  image: registry.gitlab.com/gitlab-org/step-runner:v0
  run:
    - name: changelog
      step: ./.gitlab/steps/changelog
      inputs:
        changelog: "${{job.CI_PROJECT_DIR}}/CHANGELOG.md"
        echo_latest: true
Resource Link
Job log View log
Screenshot

Observed Behavior

Jobs timeout after one hour with the Runner sending empty trace patches every minute until termination.

Expected Behavior

Jobs should complete successfully, as they do on deployments running earlier Runner versions.

Analysis

Scope of Impact

  • Issue affects only jobs using GitLab Functions
  • Standard image pulls earlier in the job succeed
  • Problem appears to occur after the image is pulled from the registry

Recent Changes

Two changes were recently made to the way Runner executes steps:

  1. gitlab-runner-helper now starts the functions gRPC server for "native" function jobs (bypassing the shim)
  2. A new Connect() method provides a generic way for executors to obtain a gRPC server connection

Log Evidence

Log analysis shows the job is received and started, followed by an hour of empty trace patches until timeout.

Metrics

  • Private runners show no increased error rate since the version change (consistent with Functions-only impact)
  • Spike in timeouts observed Friday ~12:00 UTC
  • Timeouts also occur on shared-gitlab-org runners (which were not updated), suggesting potential server-side factors: Dashboard

Environment

Component Version Status
Private runners (updated Thu evening UTC) 18.7.0~pre.390.g6d7a049f (6d7a049f) ❌ Failing
Previous working version 18.4.0~pre.246.g71914659 (71914659) ✅ Working using docker executor
Manually deployed EC2 Runner 18.6.6 (df85dadf) ✅ Working using docker executor

Recommended Actions

  1. Immediate: Investigate the root cause, focusing on the gRPC connection lifecycle
  2. Short-term: Add enhanced logging around the functions gRPC server connection and step execution
  3. Long-term: Implement monitoring for Functions-specific execution paths

⚠️ Urgency

A fix must be deployed by 2025-12-18 due to the Hard Production Change Lock.

Edited Dec 15, 2025 by Cameron Swords
Assignee Loading
Time tracking Loading