Zoekt: Node process health reporting (#21348) · Epics · GitLab.org

Zoekt: Node process health reporting

## Motivation Today, GitLab only knows if a Zoekt node is "online" (heartbeat received in the last 30s) or "offline" (no heartbeat). There is no visibility into the health of the individual processes (indexer and webserver) running on each node. Specifically: - **Crashlooping webservers** continue to receive search traffic, causing persistent search failures for users routed to that node. - **mmap exhaustion** (a known failure mode on Linux) is invisible to GitLab until the process crashes. - **There's no proactive health signal** — we only detect problems reactively from failed search requests (see the `last_search_failure_at` proposal in https://gitlab.com/gitlab-org/gitlab/-/issues/593206 and the history of the removed circuit breaker). This epic adds process-level health reporting from both the indexer and webserver, and uses that data to make smarter routing and availability decisions on the Rails side. ## Architecture Overview ``` +--------------+ +--------------+ | Webserver |---- localhost ---->| Indexer | | (port 6070) | push health via | (port 6060) | | | new internal API | | +--------------+ +------+-------+ | heartbeat POST (extended payload) | v +--------------+ | GitLab Rails | | (receives, | | stores, | | acts on) | +--------------+ ``` 1. **Webserver -> Indexer (localhost)**: Webserver pushes its health data to the indexer via a new lightweight HTTP endpoint, configured via env variable. 2. **Indexer -> Rails (heartbeat)**: Indexer bundles its own health data + the webserver's health data into the existing heartbeat payload under a new `process_health` key. 3. **Rails**: Stores health data in node metadata, evaluates health, and uses it for routing/availability decisions. ## Metrics Collected For **both** processes (indexer and webserver): | Metric | Source | Purpose | |--------|--------|---------| | mmap current / max | `/proc` (Linux) | Detect approaching mmap exhaustion (crash vector) | | Restart counts (1m/5m/15m) | Filesystem marker files | Detect crashlooping | | RSS memory | Go `ProcessCollector` (`process_resident_memory_bytes`) | Memory usage visibility | | Uptime | Go `ProcessCollector` (`process_start_time_seconds`) | Quick restart detection, admin visibility | For the **webserver** additionally: | Metric | Source | Purpose | |--------|--------|---------| | Loaded shard count | `zoekt_shards_loaded` (upstream Prometheus gauge) | Verify the node is actually serving data | All metrics are already collected by the Go runtime or existing Prometheus gauges — we just need to read them and include them in the `process_health` payload. ## Issues ### Phase 1 — Data Collection (gitlab-zoekt-indexer) - https://gitlab.com/gitlab-org/gitlab/-/issues/593552 — Webserver-to-indexer process health relay - https://gitlab.com/gitlab-org/gitlab/-/issues/593553 — Restart tracking for indexer and webserver processes - https://gitlab.com/gitlab-org/gitlab/-/issues/593554 — Collect mmap metrics from both indexer and webserver ### Phase 2 — Rails Integration - https://gitlab.com/gitlab-org/gitlab/-/issues/593555 — Extend heartbeat API to accept process health data - https://gitlab.com/gitlab-org/gitlab/-/issues/593557 — Update rake tasks (info display only, no health logic yet) ### Phase 3 — Health Intelligence - https://gitlab.com/gitlab-org/gitlab/-/issues/593556 — Node health evaluation and search routing based on process health - https://gitlab.com/gitlab-org/gitlab/-/issues/593557 — Update rake tasks (health checks) ## Implementation Order - **Phase 1** is entirely in the indexer repo and can be developed/tested independently. - **Phase 2** is backward-compatible — Rails accepts but doesn't act on the new data yet. - **Phase 3** is where the actual availability improvements happen. ## Key Design Decisions - **Naming**: `process_health` for the heartbeat payload key and internal relay endpoint - **Crashloop detection**: `restarts_15m >= 2` marks a node unhealthy. A single restart is forgiven. Recovery is automatic when restarts age out of the 15m window. The threshold is configurable via a Zoekt application setting. - **mmap thresholds**: Fixed at 80% warning / 95% unhealthy (stop traffic). Not configurable. - **Webserver staleness**: Uses the same `ONLINE_DURATION_THRESHOLD` (30s) as the indexer `.online` scope. - **Routing**: Unhealthy nodes are fully excluded from search routing, with an "all nodes unhealthy" fallback to prevent cascade failures. - **Backward compatibility**: Guarded by `all_at_least_version?` (same pattern as offset pagination in !225523). - **Relay authentication**: Localhost-only binding, no shared secret needed. - **`webserver_last_seen_at` column**: Top-level DB column on `zoekt_nodes` (not buried in JSONB) for efficient scopes and queries. ## Previous Work - https://gitlab.com/gitlab-org/gitlab/-/issues/530296 - https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/merge_requests/409 (closed — used as reference only)

epic