Elasticsearch monitoring: saturation
This issue is concerned with saturation, not setting SLOs or alerts. Once we have metrics to base these on, we can set separate SLOs for logging and search deployments of Elasticsearch.
Some resources that can be saturated by Elasticsearch:
- CPU
- Memory (watch out for this one due to Java's memory allocation behaviour - is there much we can infer from this?)
- Worst-case CPU by node (single_node_cpu)
- Disk space
- JVM heap
- Thread pools (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html)
Is anything missing from that list?
We should generate key metrics for an elasticsearch service, as we already have for the logging service. See gitlab-com/runbooks!1964 (merged) for recent changes to these. Relevant metrics may already be collected and integrated into the key metrics for the logging service.
Prometheus is our monitoring platform: prefer Prometheus metrics and alerts to Elasticsearch watchers wherever feasible. This may require extending the Elasticsearch exporter.