Expand labels recorded for saturation points

As we discuss https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919 and as a preparation step for tamland#74 (closed), this issue discusses expanding labels ("dimensions") for the saturation-related recording rules.

Note on timing: This entirely relies on recording rules, so we can only make a forward-looking change and not get historical data in more detail (easily). This means we need to be aware that when we implement the change in this issue (expanding recording rules), it will take a significant amount of time for enough data to materialize to allow for meaningful dimensional forecasts (at least 1 month).

Example

Let's start off with the example discussed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919#vertex-ai-limit.

We've currently defined a saturation point gcp_quota_limit_vertex_ai, which essentially produces a recording rule to capture the saturation ratio over time.

A shortened version looks like so:

  - record: gitlab_component_saturation:ratio
    labels:
      component: gcp_quota_limit_vertex_ai
    expr: |
      max by(env, environment, tier, type, stage, shard) (
        clamp_min(
          clamp_max(
             ....
            ,
            1)
        ,
        0)
      )

For a given env=gprd, type=ai-gateway, component=gcp_quota_limit_vertex_ai, this produces exactly one data point - without any additional dimensions/labels available from the recording rule.

On the other hand, we're interested in detailing the saturation information across a few specific labels. In this case, we already use resourceLabels: ['base_model', 'region'] to pre-aggregate saturation metrics. This is also used in saturation dashboards for ai-gateway, for example we can see in this dashboard that saturation is split up by base model (a few) and region (only one).

Now in this dashboard, we actually don't use the recorded saturation metric (which doesn't have this level of detail) but basically the same query from the recording rule with additional labels (note the additional base_model, region aggregation labels at the top):

      max by(env, environment, tier, type, stage, shard, base_model, region) (
        clamp_min(
          clamp_max(
             ....
            ,
            1)
        ,
        0)
      )

This information isn't readily available for Tamland, for example, because the saturation recording rule only produces a high-level aggregate rather than providing this level of detail.

Why do we need more detail?

This is in essence what tamland#74 (closed) is about:

In the example above, gcp_quota_limit_vertex_ai defines a saturation point, also called component - which captures the concept of how a particular bottleneck for a service can be modeled.

In our case here, the specific service (ai-gateway) has actually not only one instance of said saturation point but many! We are specifically interested in modeling saturation points differently for different base_model (e.g. text-bison, code-gecko, code-bison etc.) and also across different regions. Ultimately we want to be able to reason about individual saturation of (text-bison, us-east1), (code-gecko, us-est2).

This is because any saturation event across those combinations poses a problem.

Current state / challenges

Now so far, we've simplified and simply took the max-aggregate across all those combinations over time. This results in two challenges:

Difficult to forecast because underlying nature of time series changes (which one is the component with leading saturation? (text-bison, us-east1) or (code-gecko, us-east2)? Characteristics of both go into the max-aggregate, which is used for forecasting).
Inability to present saturation forecasts in more detail - we can't display saturation forecasts for text-bison separately from code-gecko, as there is only a single max-aggregate forecast (which may be completely bogus, see (1)).

Proposal - Status: draft / up for discussion

In order to provide saturation forecasts in more detail, we need to provide more detailed data to Tamland (this issue). Additionally, Tamland will need to cope with a significantly higher number of forecasts to run and also understand that these additional labels need to be treated similar to "shards" (not this issue, but tamland#74 (closed)).

It's relatively straight forward to expand the recording rule. gitlab-com/runbooks!7048 (diffs) basically takes resourceLabels and treats them as maxAggregateLabels. This immediately results in the following recording rule for our example:

  - record: gitlab_component_saturation:ratio
    labels:
      component: gcp_quota_limit_vertex_ai
    expr: |
      max by(env, environment, tier, type, stage, shard, base_model, region) (
        clamp_min(
          clamp_max(
            (
              sum without (method) (stackdriver_aiplatform_googleapis_com_location_aiplatform_googleapis_com_quota_online_prediction_requests_per_base_model_usage{env="gprd",type="ai-gateway"})
            /
              stackdriver_aiplatform_googleapis_com_location_aiplatform_googleapis_com_quota_online_prediction_requests_per_base_model_limit{env="gprd",type="ai-gateway"}
            ) > 0
            ,
            1)
        ,
        0)
      )

In use cases I'm aware of for gitlab_component_saturation:ratio, we do infact use additional (max) aggregates to derive actual saturation.

For example, in Tamland we use max(quantile_over_time(0.95, gitlab_component_saturation:ratio{%s}[1h])) (with type, component, env filters for %s) to get saturation for a component. Similar in dashboards like in this example. This means the change above won't affect Tamland or those dashboards, because the resulting max-aggregate is unchanged if the underlying data has additional dimensions from pre-aggregation.

Incremental rollout

We can limit this change to select saturation points. To start with, we could use this mechanic for https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919 only and - instead of duplication saturation points - we'd go down the route of this draft change but with a conditional for the relevant saturation points only.

Other saturation points would remain unaffected in terms of their recording rules.

Downsides/Risk

Do we always use a max-aggregate when using saturation data? I.e. is this change safe semantically?
We significantly increase the cardinality of those saturation recording rules.
?

Alternatives

As we discuss in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919, we can alternatively add additional saturation points targeted for each of these dimensions.

In the example above, we'd have to add a whole bunch of (nearly identical) saturation points:

gcp_quota_limit_vertex_ai_text-bison-us-east1
gcp_quota_limit_vertex_ai_code-gecko-us-east1
etc etc

I believe this is a conceptual mismatch. It will be hard to maintain, because we'll have to add more and more saturation points the more models or regions we use. Instead, if we use above proposal, actually used models and regions will be included automatically without manual intervention.

Edited Apr 10, 2024 by Andreas Brandl