Zoekt: Node health evaluation and search routing based on process health
## Summary Add logic to evaluate node health based on the new process health data and surface a per-node health status that can be used for search routing decisions. Unhealthy nodes are fully excluded from search routing. ## Health Evaluation Rules ### Crashloop Detection - `restarts_15m >= N` (configurable threshold, default: 2) for **either** process (indexer or webserver) → node is **unhealthy** - A single restart is forgiven (transient OOM, deploy, etc.) - Recovery is automatic: once the restarts age out of the 15m window, the node becomes healthy again - Add a new Zoekt application setting for the restart threshold (tunable from Rails without redeploying) - Indexer crashlooping also affects search — stale indices mean outdated results, and the heartbeat (including relayed webserver metrics) becomes unreliable ### mmap Exhaustion - `mmap_current / mmap_max >= 0.95` (95%) for **either** process → node is **unhealthy**, stop serving search traffic - `mmap_current / mmap_max >= 0.80` (80%) → warning only (operators should investigate, node stays in rotation) ### Webserver Staleness - `webserver_last_seen_at` older than `ONLINE_DURATION_THRESHOLD` (30s) → webserver is unresponsive, node is **unhealthy** for search - Simple scope: `scope :webserver_online, -> { where(webserver_last_seen_at: THRESHOLD.ago..) }` ## Search Routing Changes ### Updated `.online` Scope Update the existing `.online` scope to require both processes to be reporting when the node supports it: - Guard with `all_at_least_version?(MIN_PROCESS_HEALTH_VERSION)` — same pattern used for offset pagination in !225523 - When all nodes support process health: `.online` checks both `last_seen_at` and `webserver_last_seen_at` - When any node is on an older version: fall back to current behavior (only check `last_seen_at`) - Since `.searchable` is an alias for `.online`, search routing automatically benefits ### `search_healthy` Scope Add a `search_healthy` scope that combines `.online` with process health checks (crashloop, mmap) for the load balancer. ### "All Nodes Unhealthy" Fallback **Important**: When all nodes are marked unhealthy, fall back to all online nodes rather than hard-failing. This avoids the cascade failure pattern that caused the previous circuit breaker removal (!190464). The health exclusion should only activate when a *subset* of nodes is failing. ## Related - Connects to the `last_search_failure_at` proposal (https://gitlab.com/gitlab-org/gitlab/-/issues/593206) — proactive health signals complement reactive failure tracking - Previous circuit breaker history: !136346 (added), !190464 (removed)
issue