Skip to content

Dimension lookup during reporting and issue management

In https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/7299, we noticed that Tamland was reaching out to Prometheus during its reporting and issue management jobs to gather information about relevant dimensions for saturation points ("dynamic dimensional expansion").

In the particular case of how this is set up for Dedicated, these jobs run in CI and don't have connectivity to Prometheus (by design).

Design

In this issue, let's discuss how Tamland can support this use case and how we can change the internal design, so that these lookups are not necessary anymore outside of the forecasting job.

The underlying problem is that the manifest used to be static, but has gotten dynamic components - so in order to understand the entire manifest structure, Prometheus now has to be available to perform label lookups. While this makes sense for the forecasting jobs, downstream reporting jobs should be able to work with a fixed manifest instead of consulting Prometheus again (which may even yield different results than in the forecasting job given two different points in time).

Detection

Additionally, we only noticed that these job were failing a few weeks in (and likely only by chance and Bob's good 👀, right @reprazent ?). We should find a better way to alert on these conditions, so that we notice failing Tamland jobs right away.

For the capacity planning trackers, we use Slack notifications for failing CI jobs. We didn't have an integration in place for Dedicated, though. Those are also brittle as we've seen recently - when we rename Slack channels, notifications are sent to /dev/null instead.

Desired outcomes

  1. Conclusion for how to change design in Tamland to support this
  2. Action item to detect a similar situation automatically

cc @hmerscher @reprazent