feat: add webserver-to-indexer process health relay

What does this MR do and why?

Implements the webserver-to-indexer process health relay described in gitlab#593552.

The indexer sends a heartbeat to Rails every ~10s. Until now, Rails had visibility into the indexer's health but not the webserver's. This MR adds a relay so the indexer can include the webserver's health data inside its own heartbeat payload.

Architecture

Webserver ──POST ZOEKT_INDEXER_INTERNAL_URL/indexer/internal/process_health──▶ Indexer ──heartbeat──▶ Rails

The webserver pushes its metrics to the indexer over the loopback interface; the indexer keeps the latest report in memory and folds it into every heartbeat, omitting the webserver block when the last report is older than 30 seconds. The absence itself is the failure signal for Rails.

How it works

New internal/shared/processhealth package:

  • ProcessMetrics — wire payload (mmap_current, mmap_max, restarts_1m/5m/15m, rss_bytes, uptime_seconds, plus shards_loaded on the webserver side via omitempty).
  • WebserverReportProcessMetrics + ReceivedAt.
  • WebserverStore — lock-free atomic.Pointer[WebserverReport]; writer is the HTTP handler, reader is the heartbeat builder.
  • CollectMetrics(tracker) — reads process_resident_memory_bytes, process_start_time_seconds, mmap gauges from the Prometheus default gatherer, and restart counts from internal/process_starts. Nil tracker yields zero restart counts.
  • CollectWebserverMetrics(tracker) — same plus zoekt_shards_loaded (always populated, even when zero, so the field is always present for the webserver).
  • Handler(store)http.HandlerFunc that decodes a ProcessMetrics POST body and stores it with the receive timestamp. 405 for non-POST, 400 for malformed bodies. Auth is enforced by the indexer's existing JWT middleware on the main router, not by the handler itself.
  • StartPushLoop(ctx, url, interval, tracker, auth) — webserver push loop; returns immediately when url is empty (graceful opt-out). When auth is non-nil, every push carries a short-lived JWT in Gitlab-Zoekt-Api-Request: Bearer <token> so the indexer accepts it like any other request.
  • Constants: EndpointPath = "/indexer/internal/process_health", DefaultIndexerInternalURL = "http://localhost:6065", OnlineThreshold = 30s, PushInterval = 10s.

Indexer side (internal/mode/indexer/indexer.go, internal/server/):

  • Creates a WebserverStore at startup and wires it into IndexServer.ProcessHealthHandler (new http.HandlerFunc field on IndexServer).
  • The new route POST <PathPrefix>/internal/process_health is registered on the existing indexer router (no separate listener, no new port). The chart's indexer.listen.port: 6065 continues to be the only port the indexer exposes.
  • The route is protected by the same JWT middleware that guards every other /indexer/... route. Both processes already share the secret token, so this keeps the new endpoint consistent with the rest of the auth surface and removes any reliance on network topology for security.
  • task_request now receives StartTracker and WebserverStore so the heartbeat builder can call CollectMetrics(...) and read the latest webserver report.

Webserver side (internal/mode/webserver/webserver.go):

  • After loading the shared secret and constructing its authentication.Auth, the webserver reads ZOEKT_INDEXER_INTERNAL_URL and launches processhealth.StartPushLoop with that Auth instance.
    • Unset → defaults to http://localhost:6065 (matches the Helm chart, zero-config in production).
    • Empty (ZOEKT_INDEXER_INTERNAL_URL=) → skips reporting entirely.
    • URL → pushes to that URL every PushInterval, signing each request with a short-lived JWT.
  • Cancelled cleanly via context on shutdown.

Heartbeat payload (internal/task_request/task_request.go):

  • buildRequestPayload adds process_health with indexer always present and webserver included only if the last report's ReceivedAt is within OnlineThreshold.

GDK / local dev (Makefile):

  • The gdk-webserver-1, gdk-webserver-2, and gdk-test-webserver targets set ZOEKT_INDEXER_INTERNAL_URL to each pair's indexer port (6080, 6081, 6060 respectively) since GDK runs two pairs side by side on the same host.

Docs (README.md):

  • "Webserver-to-indexer process health relay" section explaining the env var, the default, the empty-string opt-out, the JWT signing, and the GDK setup.

Tests added

  • internal/shared/processhealth/processhealth_test.goWebserverStore get/set/replace, concurrent readers+writers (race-detector smoke test), CollectMetrics(nil) yields zero restarts, CollectWebserverMetrics always populates ShardsLoaded.
  • internal/shared/processhealth/endpoint_test.go — handler stores valid POST, rejects non-POST with 405, rejects malformed JSON with 400; StartPushLoop returns immediately on empty URL, pushes on a ticker, stops cleanly on context cancel, sends valid ProcessMetrics JSON; new: verifies that every push carries a Bearer JWT signed with the shared secret when an Auth is provided.
  • internal/server/routes_test.go — end-to-end through the chi router: POST without JWT is rejected with 401, POST with a valid JWT is accepted and invokes the handler, non-POST methods get 405, and a sanity check that other routes still require JWT (so a regression that drops the middleware doesn't silently pass).
  • internal/task_request/task_request_test.go — fresh webserver report is included in process_health.webserver; a stale report (older than OnlineThreshold) is omitted.

How to verify locally

gdk reconfigure && make build-unified && gdk restart

Check the zoekt_nodes on the Rails console. You should see two nodes having process_health in metadata.

Search::Zoekt::Node.online
[#<Search::Zoekt::Node:0x000000013923caa0
  id: 2,
  uuid: "c17e4133-a552-4626-be66-58381f9189ac",
  used_bytes: 253531934720,
  total_bytes: 994662584320,
  last_seen_at: "2026-05-12 13:10:18.765755000 +0000",
  created_at: "2025-12-10 13:34:22.175603000 +0000",
  updated_at: "2026-05-12 13:10:18.766252000 +0000",
  index_base_url: "http://localhost:6081",
  search_base_url: [FILTERED],
  metadata:
   {"name"=>"rkumar--20251208-V9YPK", "version"=>"2026.05.12-v1.15.0-13-g919f37a", "task_count"=>0, "concurrency"=>16, "process_health"=>{"indexer"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "uptime_seconds"=>93}, "webserver"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "shards_loaded"=>14, "uptime_seconds"=>91}}},
  indexed_bytes: 51393700,
  usable_storage_bytes: 741182043300,
  schema_version: 2602,
  services: [0],
  webserver_last_seen_at: "2026-05-12 13:10:18.765812000 +0000">,
 #<Search::Zoekt::Node:0x000000013923c960
  id: 1,
  uuid: "891fbc94-936f-4154-8ee4-8d0e99119bbc",
  used_bytes: 253531934720,
  total_bytes: 994662584320,
  last_seen_at: "2026-05-12 13:10:18.765815000 +0000",
  created_at: "2025-12-10 13:34:21.621143000 +0000",
  updated_at: "2026-05-12 13:10:18.766294000 +0000",
  index_base_url: "http://localhost:6080",
  search_base_url: [FILTERED],
  metadata:
   {"name"=>"rkumar--20251208-V9YPK", "version"=>"2026.05.12-v1.15.0-13-g919f37a", "task_count"=>0, "concurrency"=>16, "process_health"=>{"indexer"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "uptime_seconds"=>93}, "webserver"=>{"mmap_max"=>0, "rss_bytes"=>0, "restarts_1m"=>0, "restarts_5m"=>1, "mmap_current"=>0, "restarts_15m"=>2, "shards_loaded"=>10, "uptime_seconds"=>91}}},
  indexed_bytes: 10668608,
  usable_storage_bytes: 741141318208,
  schema_version: 2602,
  services: [0],
  webserver_last_seen_at: "2026-05-12 13:10:18.765873000 +0000">]

Backward compatibility

  • Indexer: the new route is purely additive on the existing listener; no port change, no new dependency. It uses the same JWT middleware as every other /indexer/... route.
  • Webserver: when ZOEKT_INDEXER_INTERNAL_URL is unset, the default points at the same localhost:6065 the indexer already binds to in production. When set to empty string, the loop is skipped and behavior matches the pre-MR state. Operators can opt out at any time. The webserver already loads the shared secret for its own auth, so signing pushes adds no new configuration.
  • Rails: receives an additional process_health field but is free to ignore it until #593555 lands the schema on the server side.

References

Edited by Ravi Kumar

Merge request reports

Loading