Thanos ruler is dropping a lot of metrics

Problem Summary

We rely on some calculated metrics that come from thanos-ruler in order to provide tooling that calculates burn rates for various fleets and shards. This visibility is utilized for various components to determine the health of the service over a period of time. This is rolled up into reports, publicly visible dashboards, and queried constantly by Deployer/Chatops to ensure safety during some operations (deployments/feature flag changes). When metrics go missing, this is problematic in various ways:

Reports are now lacking some of the information required to put together accurate information
When querying real-time, a given service will appear to be gone entirely, and thus is not reported - delivery#2255 (closed)

Initial Investigation

While investigating issue: delivery#2255 (closed)

We've discovered that Thanos Ruler is failing to evaluate a lot of metrics.

Logs: https://nonprod-log.gitlab.net/goto/6f5517c0-9e2c-11ec-b3a6-472d0398dd6e
Host level Metrics:
- thanos-rule-01: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=ops&var-node=thanos-rule-01-inf-ops.c.gitlab-ops.internal
- thanos-rule-02: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=ops&var-node=thanos-rule-02-inf-ops.c.gitlab-ops.internal

This is problematic as recording rules that rely on these metrics have large gaps.

Observations

The CPU is a tad stressed on these machines, they have 2 CPU's are are occasionally spiking to near 100% CPU use, but appear to bounce heavily between 50-70% usage.
- These nodes are also running Ubuntu 16.
Thanos-Ruler is querying Promethues Pods, for which are timing out when performing various queries

Utilize this issue to determine what mitigations we need to put into place to enable Thanos ruler to properly record the rules it is responsible for.

Edited Mar 08, 2022 by John Skarbek