Expand labels recorded for saturation points
As we discuss https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919 and as a preparation step for tamland#74 (closed), this issue discusses expanding labels ("dimensions") for the saturation-related recording rules.
Note on timing: This entirely relies on recording rules, so we can only make a forward-looking change and not get historical data in more detail (easily). This means we need to be aware that when we implement the change in this issue (expanding recording rules), it will take a significant amount of time for enough data to materialize to allow for meaningful dimensional forecasts (at least 1 month).
Example
Let's start off with the example discussed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919#vertex-ai-limit.
We've currently defined a saturation point gcp_quota_limit_vertex_ai
, which essentially produces a recording rule to capture the saturation ratio over time.
A shortened version looks like so:
- record: gitlab_component_saturation:ratio
labels:
component: gcp_quota_limit_vertex_ai
expr: |
max by(env, environment, tier, type, stage, shard) (
clamp_min(
clamp_max(
....
,
1)
,
0)
)
For a given env=gprd, type=ai-gateway, component=gcp_quota_limit_vertex_ai
, this produces exactly one data point - without any additional dimensions/labels available from the recording rule.
On the other hand, we're interested in detailing the saturation information across a few specific labels. In this case, we already use resourceLabels: ['base_model', 'region']
to pre-aggregate saturation metrics. This is also used in saturation dashboards for ai-gateway
, for example we can see in this dashboard that saturation is split up by base model (a few) and region (only one).
Now in this dashboard, we actually don't use the recorded saturation metric (which doesn't have this level of detail) but basically the same query from the recording rule with additional labels (note the additional base_model, region
aggregation labels at the top):
max by(env, environment, tier, type, stage, shard, base_model, region) (
clamp_min(
clamp_max(
....
,
1)
,
0)
)
This information isn't readily available for Tamland, for example, because the saturation recording rule only produces a high-level aggregate rather than providing this level of detail.
Why do we need more detail?
This is in essence what tamland#74 (closed) is about:
In the example above, gcp_quota_limit_vertex_ai
defines a saturation point, also called component - which captures the concept of how a particular bottleneck for a service can be modeled.
In our case here, the specific service (ai-gateway
) has actually not only one instance of said saturation point but many! We are specifically interested in modeling saturation points differently for different base_model
(e.g. text-bison, code-gecko, code-bison
etc.) and also across different regions. Ultimately we want to be able to reason about individual saturation of (text-bison, us-east1), (code-gecko, us-est2)
.
This is because any saturation event across those combinations poses a problem.
Current state / challenges
Now so far, we've simplified and simply took the max-aggregate across all those combinations over time. This results in two challenges:
-
Difficult to forecast because underlying nature of time series changes (which one is the component with leading saturation?
(text-bison, us-east1)
or(code-gecko, us-east2)
? Characteristics of both go into the max-aggregate, which is used for forecasting). -
Inability to present saturation forecasts in more detail - we can't display saturation forecasts for
text-bison
separately fromcode-gecko
, as there is only a single max-aggregate forecast (which may be completely bogus, see (1)).
Proposal - Status: draft / up for discussion
In order to provide saturation forecasts in more detail, we need to provide more detailed data to Tamland (this issue). Additionally, Tamland will need to cope with a significantly higher number of forecasts to run and also understand that these additional labels need to be treated similar to "shards" (not this issue, but tamland#74 (closed)).
It's relatively straight forward to expand the recording rule. gitlab-com/runbooks!7048 (diffs) basically takes resourceLabels
and treats them as maxAggregateLabels
. This immediately results in the following recording rule for our example:
- record: gitlab_component_saturation:ratio
labels:
component: gcp_quota_limit_vertex_ai
expr: |
max by(env, environment, tier, type, stage, shard, base_model, region) (
clamp_min(
clamp_max(
(
sum without (method) (stackdriver_aiplatform_googleapis_com_location_aiplatform_googleapis_com_quota_online_prediction_requests_per_base_model_usage{env="gprd",type="ai-gateway"})
/
stackdriver_aiplatform_googleapis_com_location_aiplatform_googleapis_com_quota_online_prediction_requests_per_base_model_limit{env="gprd",type="ai-gateway"}
) > 0
,
1)
,
0)
)
In use cases I'm aware of for gitlab_component_saturation:ratio
, we do infact use additional (max) aggregates to derive actual saturation.
For example, in Tamland we use max(quantile_over_time(0.95, gitlab_component_saturation:ratio{%s}[1h]))
(with type, component, env
filters for %s
) to get saturation for a component. Similar in dashboards like in this example. This means the change above won't affect Tamland or those dashboards, because the resulting max-aggregate is unchanged if the underlying data has additional dimensions from pre-aggregation.
Incremental rollout
We can limit this change to select saturation points. To start with, we could use this mechanic for https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919 only and - instead of duplication saturation points - we'd go down the route of this draft change but with a conditional for the relevant saturation points only.
Other saturation points would remain unaffected in terms of their recording rules.
Downsides/Risk
- Do we always use a max-aggregate when using saturation data? I.e. is this change safe semantically?
- We significantly increase the cardinality of those saturation recording rules.
- ?
Alternatives
As we discuss in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2919, we can alternatively add additional saturation points targeted for each of these dimensions.
In the example above, we'd have to add a whole bunch of (nearly identical) saturation points:
gcp_quota_limit_vertex_ai_text-bison-us-east1
gcp_quota_limit_vertex_ai_code-gecko-us-east1
- etc etc
I believe this is a conceptual mismatch. It will be hard to maintain, because we'll have to add more and more saturation points the more models or regions we use. Instead, if we use above proposal, actually used models and regions will be included automatically without manual intervention.