Proposal: Indexing metrics labels for efficiently deduplicating & querying time series data

Status: DEFERRED

Current Status (as of October 2023).

With the discussion around this proposal, we decided to defer implementing it for a later iteration subject to if & when the need arises. For our current implementation, we intend to:

ingest all incoming data as is without deduplicating it.
use dedicated tables/schemas per metric-type, i.e. sums, gauges, histograms, etc.
query data directly from ClickHouse & leverage its inbuilt functionality to aggregate and/or downsample data.

Problem Statement

Within the scope of building our metrics subsystem, two important design tenets would be how efficiently (1) we can ingest incoming metrics data and (2) allow our users to query it thereafter. To that end, designing the underlying system(s) to keep both our ingestion and consumption pipelines as efficient as possible would necessarily mean the following three goals:

Being able to ingest large volumes of data with low latency and high throughput.
Being able to persist data in a way our storage costs are minimised.
Being able to query data quickly and efficiently in terms of compute & memory requirements.

Note that, while the following proposal focuses on implementation details for the metrics subsystem, some of our undertaking here apply equally well to our tracing & logging subsystems in principle, especially indexing incoming data and/or metadata around it.

Proposal

Build & maintain a dynamically updated index of metric labels attached with incoming time series data to be able to:

Deduplicate data before storing them in our data stores to reduce our storage footprint.
Given a set of labels/attributes, query which time series do they correspond to.

In particular, this proposal aims at optimising our system-design for aforementioned goals 2 & 3 i.e. reducing our storage footprint to as low as possible and querying all ingested data as efficiently as possible.

The following diagram describes how we intend to ingest data within the metrics subsystem with references to the other components involved in the pipeline, especially a Redis-based indexing component.

Overview

Per this proposal, we intend to model & store all incoming time series data in a storage-optimised manner by deduplicating data and eliminating any redundancy in our storage schemas. This can be achieved by dynamically during data ingestion, building and maintaining one or more indexes of ingested metric-labels and/or metric fingerprints to then use it to deduplicate its redundant parts.

A metric-fingerprint is a deterministic hash of a metric name alongside all attached labels & their values. In Prometheus, it is computed as an FNV1-hash of all the aforementioned strings.

As described in this document, the ability to index metric labels attached with incoming time series data helps segregate it into series metadata and points data. This ensures we only ever store a single copy of any metric-specific metadata, reference it via a series_id and store metric data values against this identifier - the series_id, thus reducing any redundancy within our storage schemas by a large amount. To illustrate this idea, incoming data such as this:

my_awesome_metric{node: opstrace-101, dc: ams, region: us-east-1} 1695242575 1
my_awesome_metric{node: opstrace-101, dc: ams, region: us-east-1} 1695242576 2
my_awesome_metric{node: opstrace-101, dc: ams, region: us-east-1} 1695242589 10

can ideally be stored as two (separate) datasets efficiently:

metrics metadata

fingerprint	tags	series_id
63e67dff	{name: my_awesome_metric, node: opstrace-101, dc: ams, region: us-east-1}	123

points data (values stored as floats internally to maintain precision)

series_id	timestamp	value
123	1695242575	1.0
123	1695242576	2.0
123	1695242589	10.0

Architecture

From an architecture perspective, the following points highlight how we structure, build & maintain these indexes:

Persistent backing state necessary to (re)build indexes

All of our indexes are dynamically updated and kept-in-sync with a persistent state backing it at all times. Any index updates are done using write-through semantics with changes reflected in our ClickHouse database using transactions. The same state is also consumed on process startup to rebuild the index prior to servicing any reads or writes.

fingerprint	tags	series_id
63e67dff	{node: opstrace-101, dc: ams, region: us-east-1}	1
2103cfc1	{node: opstrace-102, dc: ams, region: us-east-1}	2
5d0cc4ee	{node: opstrace-103, dc: lhr, region: us-east-1}	3

Building multiple indexes as necessary

We intend to build support for building & maintaining multiple indexes to support any use-case warranted on the read or write paths. Some of them can be understood from the following description:

Metric fingerprint to corresponding `series_id`

The following index can be used to quickly lookup if an ingested metric has been seen previously and get its corresponding series_id directly if we have. This helps rapidly deduplicate incoming data and allow our storage schemas to store series metadata and actual points data separately.

fingerprint	series_id
63e67dff	1
2103cfc1	2
5d0cc4ee	3

Metric labels to `series_id(s)`

The following is an inverted index to allow for querying which series correspond to a given set of labels efficiently. During a query/lookup, we first get series_id(s) corresponding to one or more <tag key, tag value> pairs or definitions, then get an intersection across them to compute a list of corresponding series_id for the end result. From here, given a start & end timestamp, building the query response is straightforward.

tag key	tag value	[]series_id
node	opstrace-101	{1}
node	opstrace-102	{2}
node	opstrace-103	{3}
dc	ams	{1, 2}
dc	lhr	{3}
region	us-east-1	{1,2,3}

Index storage

The proposal intends to use a cluster-local Redis instance as the indexing store here to allow for multiple instances of the ingester process (OTEL collector) to be able to use & leverage any indexed data. This facilitates performance for all ingesters regardless of where incoming data is routed.
Redis also provides us inbuilt commands such as HSET to update multiple key-values for a given hash-key, SADD to add series_id(s) to a set and SINTER to compute intersections across multiple sets with acceptable underlying complexities. This makes modelling & implementing needed indexes trivial without writing a lot of code internally.
Subject to how quickly any index updates get processed, we can still see the same time series being referenced by multiple series_id(s) but regardless, the system remains logically correct albeit at the expense of a small amount of redundancy.

(rejected alternatives)

An in-memory index might be easier to implement here but it can only be (trivially) consumed by the process it exists on which makes it less-suitable for our use-case in case we run multiple replicas of our collector for a given pipeline.
We also studied the feasibility of using ClickHouse materialized views for indexing these data but it warrants data to flow back & forth across the ingester(s) & database very frequently. With our intentions of leveraging a cloud database instance external to the cluster, this becomes all the more expensive.

Future improvements

For efficiently computing intersections across set of series_id(s), our indexed dataset can be better structured as bitmaps.
For scanning large amounts of tag keys & values against a given query, we can utilise search techniques such as building FSTs and using them to improve our lookups across indexed data.

Advantages

Deduplication and segregation of incoming data into series metadata and point data.
Reduction in total cost of operation for the system.
Improved query performance on the read path.

Disadvantages

Additional overhead of building & maintaining indexes.

Edited Oct 10, 2023 by Ankit Bhatnagar