Discussion: Observability Service topology for metrics in cells

We had a chat in https://docs.google.com/document/d/1Zgav_4jwyRst0qguAloeX-RFvioq0fE-V9HVeqeL-9g/edit?usp=sharing about the different approach we could implement for Cells, and post Cells.

Some factoids we kept in mind:

Legacy GitLab.com is going to stay around for a while, we need to continue to support it.
We could get Legacy GitLab to look have a similar setup for metrics as a tenant in the new world, but this is not a requirement.
There are tenants that need complete isolation, GitLab.com cells don't need this
We do eventually want to be able to have a Single Pane of Glass (SPOG) that can show metrics across cells, but this does not need to have access to all metrics across all cells.
Cell-local means deployed as part of the cell, in our case, this is most likely part of the tenant-observability-stack and associated configuration.

Several topologies were discussed:

1. Cell local prometheus+Regional Mimir Collection+Global Mimir Aggregation

This setup relies on the cell-local Prometheus that eveales all recording and alerting rules and uses our global alertmanager to trigger alerts. The retention in Prometheus can remain limited to a couple of days, minimum what is needed for alerting.

The cell-local Prometheus can use remote-write to write all metrics to a regionally deployed mimir-tenant for the cell. This is in contrast to proposal 2, that only writes a limited set of metrics to Mimir. This is deployed outside of the Cell's stack.

In this setup, we would be using a global Mimir entrypoint to fan out queries to the regionally deployed Mimir when performing ad-hoc queries, for example during incident response. This allows us to keep the data retained in Prometheus to the bare minimum for rule evaluation, but use the more scalable Mimir setup for querying ranges longer than that.

Similar to proposal 2, global rule aggregations happen in a separate mimir tenant that can query all the required cell-tenants.

Advantages

The most flexible solution in terms of retention and scalability, at each level we can have a different retention of data depending on the need.
Querying raw metrics in Mimir is easier to scale than in a cell local prometheus
Alerting is not relying on global infrastructure. Querying could also be done on cell-local Prometheus if needed

Disadvantages

Duplication of some raw metrics in cell-local prometheus and mimir

2. Cell local Prometheus+Global Mimir

This setup is the closest to what the tenant-observability-stack currently does.

There is a single cell-local Prometheus that evaluates all recording- and alerting rules. This cell-local Prometheus uses our global Alertmanager to trigger alerts.

The cell-local Prometheus can use remote-write to send metrics to a Mimir-tenant available outside of the Cell. Each cell can have it's own mimir tenant, deployed outside of the Cells stack using the separate tooling we already have. Thanks to relabeling configs, we could limit the amount of series we send to mimir. This is in contrast to proposal 1, where all metrics are written to mimir.

In this setup, we would be using Cell-local metrics during investigations and incident response. We'd only need to keep the detailed data scraped from services around for a limited amount of time (weeks), while we keep the data sent to mimir around much longer.

Advantages

Boring solution: it is a straightforward solution to configure inside a cell.
Cell's observability for incident response does not become reliant on global infrastructure.
Off-the-shelf solution for multi-tiered retention.

Disadvantages

Not all metrics are going to be globally available
There is only so much a Prometheus instance can handle. Prometheus itself can be vertically scaled, but after that we need to functionally shard it within a cell, having to break up recording rules and so on. This makes the setup more complex.

3. Cell local prometheus-agent+Global mimir

This setup is closer to how we currently operate the legacy GitLab.com infrastructure.

In each cell, we deploy one or more Prometheus-agents that scrape the application and write metrics to a Mimir-tenant.

The recording- and alert-rule evaluation happens in the Mimir tenant. Outside of the cell.

Advantages

Easily horizontally scalable if the single cell would grow.
Relatively straight forward implementation.

Disadvantages

New implementation in tenant-observability-stack that is likely not going to be needed for any other kind of (smaller) tenant.
The deployment of recording rules and alerts happens outside of the tenant-observability-stack, causing drift from how we implement things in GitLab-dedciated.
Cell observability is no longer isolated from global infrastructure.

4. Cell local mimir

This implementation would require us to deploy an entire mimir installation inside a Cell. This includes all components for the read and the write path.

This would mean we can again evaluate rules locally in the cell and send those to our global alertmanager.

Global observability can be achieved by querying cross tenant as each cell would just be another tenant.

Advantages

Horizontally scalable within a cell
Tenants are completely isolated and can alert independant of the global infrastructure.

Disadvantages

Mimir is relatively complicated, with many components to deploy. This is likely overkill for a single cell.
The setup isn't suited for any kind of other deployment, it makes our observability unit not very interesting for other kinds of deployments.

Edited Jan 16, 2025 by Bob Van Landuyt