Thanos ruler is dropping a lot of metrics
Problem Summary
We rely on some calculated metrics that come from thanos-ruler
in order to provide tooling that calculates burn rates for various fleets and shards. This visibility is utilized for various components to determine the health of the service over a period of time. This is rolled up into reports, publicly visible dashboards, and queried constantly by Deployer/Chatops to ensure safety during some operations (deployments/feature flag changes). When metrics go missing, this is problematic in various ways:
- Reports are now lacking some of the information required to put together accurate information
- When querying real-time, a given service will appear to be gone entirely, and thus is not reported - delivery#2255 (closed)
Initial Investigation
While investigating issue: delivery#2255 (closed)
We've discovered that Thanos Ruler is failing to evaluate a lot of metrics.
- Logs: https://nonprod-log.gitlab.net/goto/6f5517c0-9e2c-11ec-b3a6-472d0398dd6e
- Host level Metrics:
This is problematic as recording rules that rely on these metrics have large gaps.
Observations
- The CPU is a tad stressed on these machines, they have 2 CPU's are are occasionally spiking to near 100% CPU use, but appear to bounce heavily between 50-70% usage.
- These nodes are also running Ubuntu 16.
- Thanos-Ruler is querying Promethues Pods, for which are timing out when performing various queries
Utilize this issue to determine what mitigations we need to put into place to enable Thanos ruler to properly record the rules it is responsible for.