Enable GitLab Functions on instance executor

Summary

Add native steps (GitLab Functions) support to the instance executor, mirroring the docker executor's integration in executors/docker/steps.go (bootstrap, ready marker, serve, proxy). The instance executor already exposes Run / DialRun on executors.Client (see executors/environment.go:17-29 and executors/internal/autoscaler/acquisition.go:298-318), so the connector can be implemented on top of these without a new transport.

Lifecycle decision: per-job, "No smarts"

step-runner is multi-tenant capable, but the simplest first cut is to spawn one step-runner serve per job and tie its lifetime to the job. That mirrors how script: already behaves and keeps us off the long-lived/process-survives-disconnect path that's awkward over SSH/WinRM.

  • One gitlab-runner steps serve process per build.
  • Per-build private unix socket (no shared/well-known socket).
  • The serve process is torn down when the job reaches a final state — success, failure, graceful cancel, or hard abort all converge on the same cleanup. The connector should not try to distinguish between these; cleanup is driven by Close() on the returned io.ReadWriteCloser plus the executor's Cleanup(), both of which run regardless of how the build ended.
  • Resumability across reconnects/runner restarts is not supported in this iteration.

A daemon-mode (well-known socket, lifecycle owned by systemd, opt-in for users who want resumable jobs across network glitches/runner crashes) is explicitly deferred to a follow-up. Per-job and daemon-mode can co-exist later: daemon mode would have a well-known socket path, and the connector can try that first before falling back to spawning per-job.

Feature Flag

This feature will be gated behind a new feature flag: FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR (added to helpers/featureflags/flags.go alongside UseScriptToStepMigration and UseConcrete).

Note: FF_SCRIPT_TO_STEP_MIGRATION can be enabled independently of FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR. This allows users to migrate scripts to steps without enabling instance executor functions support, which may be useful for testing or gradual rollout scenarios.

Important: The instance executor must NOT check this feature flag when a job uses the run: syntax. Since GitLab Functions is experimental, gating run: usage behind a feature flag is not required. The feature flag only gates the migration path from script: to steps. The check in common/steps.go:UseNativeSteps already keys off Job.Run length OR the migration flags, so the gating belongs at the executor-feature level (see "Enabling" below), not in UseNativeSteps.

Scope / Tasks

1. Enabling

  • In executors/instance/instance.go, the featuresUpdater currently sets only Variables and Shared. Add features.NativeStepsIntegration = true, conditional on:
    • The runner is non-Windows (UseNativeSteps already guards this, but we should not advertise the feature on Windows instance providers).
    • FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR is on OR the job uses run: (i.e. don't gate run: behind the flag — see note above).
  • Mirror the docker pattern of having a commandExecutor wrapper if needed, or add Connect directly on the instance executor type and var _ steps.Connector = (*executor)(nil) to enforce the interface.
  • Compile-time interface assertion lives next to similar ones in executors/internal/autoscaler/executor.go:17 and executors/docker/machine/machine.go:19.

2. Implement steps.Connector (Connect(ctx) (func() (io.ReadWriteCloser, error), error))

Mirror executors/docker/steps.go:Connect adapted to the instance transport:

  • Stream stderr from the spawned serve process through executors/internal/readywriter (readywriter.New(ctx, stderr)) to capture the "step-runner is listening on socket <path>" marker emitted by commands/steps/steps.go:readyMessage.
  • Spawn gitlab-runner steps serve --socket <sockPath> on the instance via client.DialRun(ctx, cmd). Using DialRun (not Run) is important:
    • DialRun returns a net.Conn whose lifetime is bound to the remote process — closing the conn terminates the serve process, which gives us free cleanup whenever the build is torn down.
    • It bypasses the BuildShell wrapper and shell-script-on-stdin path that Run uses, which is exactly how we keep job env vars out of the serve process's environment (see "Environment isolation" below).
    • Docker uses the same trick to keep docker system dial-stdio alive past job-context cancellation in executors/docker/docker.go:1335-1348.
  • Read stdout/stderr from the returned conn into the build log; tee stderr through the readywriter so the ready marker can be detected.
  • Block until either:
    • the ready channel yields a socket path → continue;
    • the conn returns EOF / serve exits early → return error wrapped as BuildError with the exit code (cf. connector.ExitError handling at executors/internal/autoscaler/acquisition.go:308-315);
    • ctx.Done() → return ctx.Err().
  • Return a closure that, on each invocation, calls client.DialRun(ctx, "gitlab-runner steps proxy --socket <sockPath>") and wraps the resulting net.Conn as the io.ReadWriteCloser consumed by steps.Execute (called from common/build.go:478-518). Close() on this wrapper must close the proxy conn AND close the serve conn so the serve process terminates.

gitlab-runner steps serve and gitlab-runner steps proxy are already implemented at commands/steps/steps.go:226-285. The gitlab-runner binary is assumed on the instance's PATH — main.go:155 registers the provider with "gitlab-runner" as runnerCommandPath, matching the existing assumption for script: execution.

3. Per-build socket isolation

The instance executor allows multiple concurrent jobs per VM (CapacityPerInstance can be >1 — see integration test at executors/instance/instance_integration_test.go:78), so the socket path must include the build ID to avoid collisions:

  • Default: <remote-tmp-dir>/gitlab-runner-steps/<buildID>/step-runner.sock.
    • Use the OS temp dir on the remote (/tmp on Linux). For now, hard-code per-OS; we already select per-OS shells.
    • <buildID> = Build.ID (job ID, unique within a runner).
  • step-runner serve creates the socket; we just need to pass --socket <path> and ensure the parent directory exists. The serve command in commands/steps/steps.go:Serve opens the listener on the path we pass.
  • Cleanup: best-effort rm -rf <dir> after the conn closes. Acceptable to leak on hard crashes — instance lifecycle (max-use, idle-time) will reap eventually.

4. Environment isolation ("no env" run path)

Today, executor.Run (executors/instance/instance.go:75-88) uses e.BuildShell.CmdLine (typically bash or bash -l) and pipes a generated shell script via stdin. The script exports every job variable before running anything (shells/bash.go / shells/abstract.go). Inheriting that into step-runner serve's environment defeats the point — the server would see all job vars and could leak them into step processes.

DialRun already avoids this: it sends just the command, with no shell-script prelude. As long as we invoke serve via DialRun(ctx, "gitlab-runner steps serve --socket …"), the spawned process inherits only the SSH/WinRM session's ambient env (typically just PATH, HOME, USER).

Action items:

  • Confirm via tests that the serve-side process env contains no CI_* / job-level vars (see test list below).
  • If a future need arises to also strip ambient SSH-session vars, we can add an explicit env -i prefix on the remote command. Not required for V1.
  • Do not introduce a separate "no env" RunOptions field — the choice between shell-wrapped (Run) and direct exec (DialRun) already covers both modes.

5. Logging

  • stderr from the serve conn → BuildLogger.Stream(StreamWorkLevel, Stderr) (after teeing through readywriter).
  • stdout from the serve conn → BuildLogger.Stream(StreamWorkLevel, Stdout)step-runner serve doesn't normally write to stdout, but capture it for diagnostics.
  • The proxy conn is gRPC-framed; do not inject it into the build log. Use executors/docker/internal/omitwriter (already used by docker steps proxy at executors/docker/steps.go:152) or an equivalent for stderr suppression on the proxy side.

6. Tests

  • Unit tests for the Connect flow:
    • ready marker detected → returns dialer closure;
    • serve exits non-zero before ready → returns BuildError with normalized exit code;
    • context cancel before ready → returns ctx.Err();
    • dialer closure invokes client.DialRun with the right gitlab-runner steps proxy --socket … command.
  • Socket-isolation test: two concurrent Connect calls on the same instance produce different socket paths and don't collide.
  • Environment-isolation test: assert that the command passed to DialRun is gitlab-runner steps serve --socket … with no shell wrapping, and that no CI_* variable appears in the spawned process's env (assert at the mocked client.DialRun boundary).
  • Final-state cleanup test: regardless of how the build ends (success / script failure / graceful cancel via context / hard executor abort via Cleanup()), the serve conn is closed and the per-build socket directory is removed. Mocks should assert close-ordering: proxy conn closes before serve conn.
  • Integration test (Linux + bash) running a minimal run: job through the existing instance_integration_test.go harness — verifies end-to-end with the real gitlab-runner steps serve/proxy against the SSH stub server.

Implementation Notes

  • File layout: add executors/instance/steps.go (new) for Connect + helpers; minimal edits to executors/instance/instance.go to add the steps.Connector interface assertion and update featuresUpdater.
  • Reuse executors/internal/readywriter and the docker-style omitwriter rather than reinventing.
  • The connector wires into the rest of the runner via common/build.go:573 (executor.(steps.Connector)); no changes needed in common/.
  • gitlab-runner-helper is not required on the instance for this feature — unlike docker, where the helper is bootstrapped into a volume, the instance executor calls the full gitlab-runner binary that is already expected to be installed on the VM.

Acceptance Criteria

  • Instance executor advertises NativeStepsIntegration and runs steps for jobs using the run: keyword without any feature flag.
  • FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR enables the script: → steps migration path on the instance executor.
  • Steps proxy connects successfully through the instance transport.
  • No job variables are present in the OS environment of the spawned step-runner serve process.
  • Socket paths are isolated per build (build-ID-scoped directory under remote tmp).
  • When the job reaches a final state — success, failure, graceful cancel, or hard abort — the serve process and its socket are cleaned up. No leaked processes or sockets after the build ends.
  • Daemon-mode is not delivered in this iteration; tracked as a follow-up.
  • Tests pass (unit + integration).
Edited by Arran Walker