Enable GitLab Functions on instance executor
Summary
Add native steps (GitLab Functions) support to the instance executor, mirroring the docker executor's integration in executors/docker/steps.go (bootstrap, ready marker, serve, proxy). The instance executor already exposes Run / DialRun on executors.Client (see executors/environment.go:17-29 and executors/internal/autoscaler/acquisition.go:298-318), so the connector can be implemented on top of these without a new transport.
Lifecycle decision: per-job, "No smarts"
step-runner is multi-tenant capable, but the simplest first cut is to spawn one step-runner serve per job and tie its lifetime to the job. That mirrors how script: already behaves and keeps us off the long-lived/process-survives-disconnect path that's awkward over SSH/WinRM.
- One
gitlab-runner steps serveprocess per build. - Per-build private unix socket (no shared/well-known socket).
- The serve process is torn down when the job reaches a final state — success, failure, graceful cancel, or hard abort all converge on the same cleanup. The connector should not try to distinguish between these; cleanup is driven by
Close()on the returnedio.ReadWriteCloserplus the executor'sCleanup(), both of which run regardless of how the build ended. - Resumability across reconnects/runner restarts is not supported in this iteration.
A daemon-mode (well-known socket, lifecycle owned by systemd, opt-in for users who want resumable jobs across network glitches/runner crashes) is explicitly deferred to a follow-up. Per-job and daemon-mode can co-exist later: daemon mode would have a well-known socket path, and the connector can try that first before falling back to spawning per-job.
Feature Flag
This feature will be gated behind a new feature flag: FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR (added to helpers/featureflags/flags.go alongside UseScriptToStepMigration and UseConcrete).
Note: FF_SCRIPT_TO_STEP_MIGRATION can be enabled independently of FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTOR. This allows users to migrate scripts to steps without enabling instance executor functions support, which may be useful for testing or gradual rollout scenarios.
Important: The instance executor must NOT check this feature flag when a job uses the run: syntax. Since GitLab Functions is experimental, gating run: usage behind a feature flag is not required. The feature flag only gates the migration path from script: to steps. The check in common/steps.go:UseNativeSteps already keys off Job.Run length OR the migration flags, so the gating belongs at the executor-feature level (see "Enabling" below), not in UseNativeSteps.
Scope / Tasks
1. Enabling
- In
executors/instance/instance.go, thefeaturesUpdatercurrently sets onlyVariablesandShared. Addfeatures.NativeStepsIntegration = true, conditional on:- The runner is non-Windows (
UseNativeStepsalready guards this, but we should not advertise the feature on Windows instance providers). FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTORis on OR the job usesrun:(i.e. don't gaterun:behind the flag — see note above).
- The runner is non-Windows (
- Mirror the docker pattern of having a
commandExecutorwrapper if needed, or addConnectdirectly on the instance executor type andvar _ steps.Connector = (*executor)(nil)to enforce the interface. - Compile-time interface assertion lives next to similar ones in
executors/internal/autoscaler/executor.go:17andexecutors/docker/machine/machine.go:19.
2. Implement steps.Connector (Connect(ctx) (func() (io.ReadWriteCloser, error), error))
Mirror executors/docker/steps.go:Connect adapted to the instance transport:
- Stream stderr from the spawned serve process through
executors/internal/readywriter(readywriter.New(ctx, stderr)) to capture the"step-runner is listening on socket <path>"marker emitted bycommands/steps/steps.go:readyMessage. - Spawn
gitlab-runner steps serve --socket <sockPath>on the instance viaclient.DialRun(ctx, cmd). UsingDialRun(notRun) is important:DialRunreturns anet.Connwhose lifetime is bound to the remote process — closing the conn terminates the serve process, which gives us free cleanup whenever the build is torn down.- It bypasses the BuildShell wrapper and shell-script-on-stdin path that
Runuses, which is exactly how we keep job env vars out of the serve process's environment (see "Environment isolation" below). - Docker uses the same trick to keep
docker system dial-stdioalive past job-context cancellation inexecutors/docker/docker.go:1335-1348.
- Read stdout/stderr from the returned conn into the build log; tee stderr through the
readywriterso the ready marker can be detected. - Block until either:
- the ready channel yields a socket path → continue;
- the conn returns EOF / serve exits early → return error wrapped as
BuildErrorwith the exit code (cf.connector.ExitErrorhandling atexecutors/internal/autoscaler/acquisition.go:308-315); ctx.Done()→ returnctx.Err().
- Return a closure that, on each invocation, calls
client.DialRun(ctx, "gitlab-runner steps proxy --socket <sockPath>")and wraps the resultingnet.Connas theio.ReadWriteCloserconsumed bysteps.Execute(called fromcommon/build.go:478-518).Close()on this wrapper must close the proxy conn AND close the serve conn so the serve process terminates.
gitlab-runner steps serve and gitlab-runner steps proxy are already implemented at commands/steps/steps.go:226-285. The gitlab-runner binary is assumed on the instance's PATH — main.go:155 registers the provider with "gitlab-runner" as runnerCommandPath, matching the existing assumption for script: execution.
3. Per-build socket isolation
The instance executor allows multiple concurrent jobs per VM (CapacityPerInstance can be >1 — see integration test at executors/instance/instance_integration_test.go:78), so the socket path must include the build ID to avoid collisions:
- Default:
<remote-tmp-dir>/gitlab-runner-steps/<buildID>/step-runner.sock.- Use the OS temp dir on the remote (
/tmpon Linux). For now, hard-code per-OS; we already select per-OS shells. <buildID>=Build.ID(job ID, unique within a runner).
- Use the OS temp dir on the remote (
step-runner servecreates the socket; we just need to pass--socket <path>and ensure the parent directory exists. The serve command incommands/steps/steps.go:Serveopens the listener on the path we pass.- Cleanup: best-effort
rm -rf <dir>after the conn closes. Acceptable to leak on hard crashes — instance lifecycle (max-use, idle-time) will reap eventually.
4. Environment isolation ("no env" run path)
Today, executor.Run (executors/instance/instance.go:75-88) uses e.BuildShell.CmdLine (typically bash or bash -l) and pipes a generated shell script via stdin. The script exports every job variable before running anything (shells/bash.go / shells/abstract.go). Inheriting that into step-runner serve's environment defeats the point — the server would see all job vars and could leak them into step processes.
DialRun already avoids this: it sends just the command, with no shell-script prelude. As long as we invoke serve via DialRun(ctx, "gitlab-runner steps serve --socket …"), the spawned process inherits only the SSH/WinRM session's ambient env (typically just PATH, HOME, USER).
Action items:
- Confirm via tests that the serve-side process env contains no
CI_*/ job-level vars (see test list below). - If a future need arises to also strip ambient SSH-session vars, we can add an explicit
env -iprefix on the remote command. Not required for V1. - Do not introduce a separate "no env"
RunOptionsfield — the choice between shell-wrapped (Run) and direct exec (DialRun) already covers both modes.
5. Logging
- stderr from the serve conn →
BuildLogger.Stream(StreamWorkLevel, Stderr)(after teeing throughreadywriter). - stdout from the serve conn →
BuildLogger.Stream(StreamWorkLevel, Stdout)—step-runner servedoesn't normally write to stdout, but capture it for diagnostics. - The proxy conn is gRPC-framed; do not inject it into the build log. Use
executors/docker/internal/omitwriter(already used by docker steps proxy atexecutors/docker/steps.go:152) or an equivalent for stderr suppression on the proxy side.
6. Tests
- Unit tests for the
Connectflow:- ready marker detected → returns dialer closure;
- serve exits non-zero before ready → returns
BuildErrorwith normalized exit code; - context cancel before ready → returns
ctx.Err(); - dialer closure invokes
client.DialRunwith the rightgitlab-runner steps proxy --socket …command.
- Socket-isolation test: two concurrent
Connectcalls on the same instance produce different socket paths and don't collide. - Environment-isolation test: assert that the command passed to
DialRunisgitlab-runner steps serve --socket …with no shell wrapping, and that noCI_*variable appears in the spawned process's env (assert at the mockedclient.DialRunboundary). - Final-state cleanup test: regardless of how the build ends (success / script failure / graceful cancel via context / hard executor abort via
Cleanup()), the serve conn is closed and the per-build socket directory is removed. Mocks should assert close-ordering: proxy conn closes before serve conn. - Integration test (Linux + bash) running a minimal
run:job through the existinginstance_integration_test.goharness — verifies end-to-end with the realgitlab-runner steps serve/proxyagainst the SSH stub server.
Implementation Notes
- File layout: add
executors/instance/steps.go(new) forConnect+ helpers; minimal edits toexecutors/instance/instance.goto add thesteps.Connectorinterface assertion and updatefeaturesUpdater. - Reuse
executors/internal/readywriterand the docker-styleomitwriterrather than reinventing. - The connector wires into the rest of the runner via
common/build.go:573(executor.(steps.Connector)); no changes needed incommon/. gitlab-runner-helperis not required on the instance for this feature — unlike docker, where the helper is bootstrapped into a volume, the instance executor calls the fullgitlab-runnerbinary that is already expected to be installed on the VM.
Acceptance Criteria
- Instance executor advertises
NativeStepsIntegrationand runs steps for jobs using therun:keyword without any feature flag. -
FF_FUNCTION_MIGRATIONS_ON_INSTANCE_EXECUTORenables thescript:→ steps migration path on the instance executor. - Steps proxy connects successfully through the instance transport.
- No job variables are present in the OS environment of the spawned
step-runner serveprocess. - Socket paths are isolated per build (build-ID-scoped directory under remote tmp).
- When the job reaches a final state — success, failure, graceful cancel, or hard abort — the serve process and its socket are cleaned up. No leaked processes or sockets after the build ends.
- Daemon-mode is not delivered in this iteration; tracked as a follow-up.
- Tests pass (unit + integration).