feat: add webserver-to-indexer process health relay
What does this MR do and why?
Implements the webserver-to-indexer process health relay described in gitlab#593552.
The indexer sends a heartbeat to Rails every ~10s. Until now, Rails had visibility into the indexer's health but not the webserver's. This MR adds a relay so the indexer can include the webserver's health data inside its own heartbeat payload.
Architecture
Webserver ──POST ZOEKT_INDEXER_INTERNAL_URL/indexer/internal/process_health──▶ Indexer ──heartbeat──▶ RailsThe webserver pushes its metrics to the indexer over the loopback interface; the indexer keeps the latest report in memory and folds it into every heartbeat, omitting the webserver block when the last report is older than 30 seconds. The absence itself is the failure signal for Rails.
How it works
New internal/shared/processhealth package:
ProcessMetrics— wire payload (mmap_current,mmap_max,restarts_1m/5m/15m,rss_bytes,uptime_seconds, plusshards_loadedon the webserver side viaomitempty).WebserverReport—ProcessMetrics+ReceivedAt.WebserverStore— lock-freeatomic.Pointer[WebserverReport]; writer is the HTTP handler, reader is the heartbeat builder.CollectMetrics(tracker)— readsprocess_resident_memory_bytes,process_start_time_seconds, mmap gauges from the Prometheus default gatherer, and restart counts frominternal/process_starts. Nil tracker yields zero restart counts.CollectWebserverMetrics(tracker)— same pluszoekt_shards_loaded(always populated, even when zero, so the field is always present for the webserver).Handler(store)—http.HandlerFuncthat decodes aProcessMetricsPOST body and stores it with the receive timestamp. 405 for non-POST, 400 for malformed bodies. Auth is enforced by the indexer's existing JWT middleware on the main router, not by the handler itself.StartPushLoop(ctx, url, interval, tracker, auth)— webserver push loop; returns immediately whenurlis empty (graceful opt-out). Whenauthis non-nil, every push carries a short-lived JWT inGitlab-Zoekt-Api-Request: Bearer <token>so the indexer accepts it like any other request.- Constants:
EndpointPath = "/indexer/internal/process_health",DefaultIndexerInternalURL = "http://localhost:6065",OnlineThreshold = 30s,PushInterval = 10s.
Indexer side (internal/mode/indexer/indexer.go, internal/server/):
- Creates a
WebserverStoreat startup and wires it intoIndexServer.ProcessHealthHandler(newhttp.HandlerFuncfield onIndexServer). - The new route
POST <PathPrefix>/internal/process_healthis registered on the existing indexer router (no separate listener, no new port). The chart'sindexer.listen.port: 6065continues to be the only port the indexer exposes. - The route is protected by the same JWT middleware that guards every other
/indexer/...route. Both processes already share the secret token, so this keeps the new endpoint consistent with the rest of the auth surface and removes any reliance on network topology for security. task_requestnow receivesStartTrackerandWebserverStoreso the heartbeat builder can callCollectMetrics(...)and read the latest webserver report.
Webserver side (internal/mode/webserver/webserver.go):
- After loading the shared secret and constructing its
authentication.Auth, the webserver readsZOEKT_INDEXER_INTERNAL_URLand launchesprocesshealth.StartPushLoopwith thatAuthinstance.- Unset → defaults to
http://localhost:6065(matches the Helm chart, zero-config in production). - Empty (
ZOEKT_INDEXER_INTERNAL_URL=) → skips reporting entirely. - URL → pushes to that URL every
PushInterval, signing each request with a short-lived JWT.
- Unset → defaults to
- Cancelled cleanly via context on shutdown.
Heartbeat payload (internal/task_request/task_request.go):
buildRequestPayloadaddsprocess_healthwithindexeralways present andwebserverincluded only if the last report'sReceivedAtis withinOnlineThreshold.
GDK / local dev (Makefile):
- The
gdk-webserver-1,gdk-webserver-2, andgdk-test-webservertargets setZOEKT_INDEXER_INTERNAL_URLto each pair's indexer port (6080,6081,6060respectively) since GDK runs two pairs side by side on the same host.
Docs (README.md):
- "Webserver-to-indexer process health relay" section explaining the env var, the default, the empty-string opt-out, the JWT signing, and the GDK setup.
Tests added
internal/shared/processhealth/processhealth_test.go—WebserverStoreget/set/replace, concurrent readers+writers (race-detector smoke test),CollectMetrics(nil)yields zero restarts,CollectWebserverMetricsalways populatesShardsLoaded.internal/shared/processhealth/endpoint_test.go— handler stores valid POST, rejects non-POST with 405, rejects malformed JSON with 400;StartPushLoopreturns immediately on empty URL, pushes on a ticker, stops cleanly on context cancel, sends validProcessMetricsJSON; new: verifies that every push carries a Bearer JWT signed with the shared secret when anAuthis provided.internal/server/routes_test.go— end-to-end through the chi router: POST without JWT is rejected with 401, POST with a valid JWT is accepted and invokes the handler, non-POST methods get 405, and a sanity check that other routes still require JWT (so a regression that drops the middleware doesn't silently pass).internal/task_request/task_request_test.go— fresh webserver report is included inprocess_health.webserver; a stale report (older thanOnlineThreshold) is omitted.
How to verify locally
gdk reconfigure && make build-unified && gdk restartCheck the zoekt_nodes on the Rails console. You should see two nodes having process_health in metadata.
Search::Zoekt::Node.online[#<Search::Zoekt::Node:0x000000013923caa0
id: 2,
uuid: "c17e4133-a552-4626-be66-58381f9189ac",
used_bytes: 253531934720,
total_bytes: 994662584320,
last_seen_at: "2026-05-12 13:10:18.765755000 +0000",
created_at: "2025-12-10 13:34:22.175603000 +0000",
updated_at: "2026-05-12 13:10:18.766252000 +0000",
index_base_url: "http://localhost:6081",
search_base_url: [FILTERED],
metadata:
{"name"=>"rkumar--20251208-V9YPK", "version"=>"2026.05.12-v1.15.0-13-g919f37a", "task_count"=>0, "concurrency"=>16, "process_health"=>{"indexer"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "uptime_seconds"=>93}, "webserver"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "shards_loaded"=>14, "uptime_seconds"=>91}}},
indexed_bytes: 51393700,
usable_storage_bytes: 741182043300,
schema_version: 2602,
services: [0],
webserver_last_seen_at: "2026-05-12 13:10:18.765812000 +0000">,
#<Search::Zoekt::Node:0x000000013923c960
id: 1,
uuid: "891fbc94-936f-4154-8ee4-8d0e99119bbc",
used_bytes: 253531934720,
total_bytes: 994662584320,
last_seen_at: "2026-05-12 13:10:18.765815000 +0000",
created_at: "2025-12-10 13:34:21.621143000 +0000",
updated_at: "2026-05-12 13:10:18.766294000 +0000",
index_base_url: "http://localhost:6080",
search_base_url: [FILTERED],
metadata:
{"name"=>"rkumar--20251208-V9YPK", "version"=>"2026.05.12-v1.15.0-13-g919f37a", "task_count"=>0, "concurrency"=>16, "process_health"=>{"indexer"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "uptime_seconds"=>93}, "webserver"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "shards_loaded"=>10, "uptime_seconds"=>91}}},
indexed_bytes: 10668608,
usable_storage_bytes: 741141318208,
schema_version: 2602,
services: [0],
webserver_last_seen_at: "2026-05-12 13:10:18.765873000 +0000">]Backward compatibility
- Indexer: the new route is purely additive on the existing listener; no port change, no new dependency. It uses the same JWT middleware as every other
/indexer/...route. - Webserver: when
ZOEKT_INDEXER_INTERNAL_URLis unset, the default points at the samelocalhost:6065the indexer already binds to in production. When set to empty string, the loop is skipped and behavior matches the pre-MR state. Operators can opt out at any time. The webserver already loads the shared secret for its own auth, so signing pushes adds no new configuration. - Rails: receives an additional
process_healthfield but is free to ignore it until #593555 lands the schema on the server side.
References
- Spec: gitlab#593552
- Epic: gitlab-org#21348
- GDK companion: gitlab-development-kit!5950 (merged)
- Blocked-by (already merged): restart tracking !904 (merged)