Surface enrichment errors in Grafana
Problem
With the implementation of enrichment failure tracking in #15572, we now have enrichment error data stored in ClickHouse. However, this data is not yet visible to operations teams for monitoring and alerting purposes.
Currently, enrichment failures can only be observed through:
- Application logs (which are eventually rotated)
- Direct ClickHouse queries (requiring technical expertise)
This lack of visibility makes it difficult to:
- Monitor enrichment health in real-time
- Set up alerts for critical failure thresholds
- Track trends and identify patterns in enrichment issues
- Provide operational teams with actionable dashboards
Proposal
Create Grafana dashboards to surface enrichment error metrics using the ClickHouse failure tracking table implemented in #15572.
Possible Metrics to Surface
-
Unresolved Errors Over Time
- Trend of unique enrichment failures minus successfully enriched events
- Helps identify if the error backlog is growing
-
Failure Rate by Reason
- Top failure reasons ranked by frequency
- Enables prioritization of fixes based on impact
-
Raw vs Processed Events
- Gap between total raw events and successfully processed events
- Shows overall enrichment pipeline health
-
Error Volume Trends
- Daily/hourly error counts to identify spikes or patterns
- Useful for capacity planning and incident detection
Edited by Tarun Vellishetty