Audit Events Performance Limits

Introduction

This issue outlines the proposal for addressing the performance and scalability issues related to audit events in GitLab. The issue provides an overview of the problem, proposes the adoption of ClickHouse as a solution, and compares ClickHouse with PostgreSQL and Elasticsearch. Additionally, it outlines the benefits and use cases for audit events.

Problem Statement

The current compliance report query experiences performance issues when handling audit events, which has necessitated the implementation of a 30-day limit on querying audit events. The growing volume of audit events, currently at approximately 50 million events per month worsens these performance issues. As a result, the system cannot store all audit events, such as git actions, and experiences timeouts and performance issues in the AuditEventsController and related API endpoints. Related issues: Optimize compliance violations query (#363357 - closed) In-operator optimization support for JOINs (#371164 - closed)

Use Cases for Audit Events:

When selecting a database for audit events, we must consider the following use cases to ensure optimal performance and functionality:

High volume data storage: The database must be able to accommodate a large volume of audit events data without compromising performance.
Support for billions of records: The database must be capable of handling billions of records, as the number of audit events is expected to increase over time.
Capability to run complex queries for reports and analysis: The database should enable the execution of complex queries to generate insightful reports and analyses.
Full-text search support: While not a critical requirement for audit events, the database should ideally offer full-text search capabilities for easier data retrieval and improved user experience.
Immutability of audit events: Since audit events are immutable records that are never updated, the database should be optimized for this use case. ClickHouse is well-suited for this purpose, as noted in GitLab's documentation (https://docs.gitlab.com/ee/development/database/clickhouse/#how-it-differs-from-postgresql).

It is important to note that these use cases are particularly relevant for larger instances where the volume of audit events and the complexity of queries are more significant.

Given these use cases, we considered Clickhouse or Elasticsearch as a database option for audit events, below we compare Clickhouse with Elasticsearch.

Clickhouse vs Elasticsearch

The following points compare Clickhouse and Elasticsearch for our audit events use cases.

Storage efficiency: ClickHouse utilizes columnar storage and offers high compression rates, which can help reduce storage costs compared to Elasticsearch, particularly for large datasets.
Analytical query performance: ClickHouse is specifically designed for analytical workloads and can handle complex queries on large datasets more efficiently than Elasticsearch.
SQL support: ClickHouse uses a SQL query language, making it easier for users familiar with SQL to write complex queries and integrate with existing tools and applications. This will also depend on how we build the interface layer: !112815 (merged)
Cost-effective: Elasticsearch indexes can consume a significant amount of storage space, leading to higher storage costs, especially when dealing with billions of rows. Furthermore, ClickHouse generally requires less RAM compared to Elasticsearch due to its columnar storage design, which can result in cost savings in terms of hardware requirements.
Search: Although ClickHouse may not provide the same advanced full-text search features as Elasticsearch, its strengths in handling large volumes of data, analytical query performance, storage efficiency, and cost-effectiveness make it a suitable choice for our specific audit events use cases. Full-text search is not a critical requirement, but we can still optimize search in ClickHouse using sharding, partitioning, or materialized views. On the other hand, Elasticsearch can be very costly if we want to maintain indexes for billions of records.

Interesting read:

Proposal:

To address these performance and scalability issues, we propose exploring the use of ClickHouse, a database system optimized for the aggregation and filtering of results. By implementing ClickHouse, we aim to achieve the following benefits:

Reduce the disk space used by the main PostgreSQL database by removing the audit_events table entirely from it.
Improve the performance of audit events APIs, allowing for faster retrieval of audit event data.
Enable visualization and analytics of audit events, providing insights into system usage patterns and potential areas for improvement.

Performance comparison: Clickhouse vs Postgresql

To analyze the performance gain, we can run the same query on PostgreSQL and ClickHouse using a local instance with the same set of data.

Query:

SELECT
    audit_events.*
FROM ((
        SELECT
            audit_events.*
        FROM
            audit_events
        WHERE
            audit_events.entity_type = 'Group'
            AND audit_events.entity_id = 25)
    UNION (  --  UNION ALL for CH
        SELECT
            audit_events.*
        FROM
            audit_events
        WHERE
            audit_events.entity_type = 'Project'
            AND audit_events.entity_id IN (3, 2, 19, 6))) audit_events

For a lower number of rows, there is the same performance, but as the number of rows grows, the read execution time of ClickHouse is much lower compared to PostgreSQL.

Number of rows	PostgreSQL	ClickHouse
1000	2.1 ms	2 ms
10000	5 ms	7 ms
100000	42 ms	9 ms
200000	80 ms	10 ms
550000	210 ms	12 ms

PoC:

Experimental Clickhouse schema:

DROP TABLE IF EXISTS audit_events;

CREATE TABLE audit_events(
  entity_id bigint,
  entity_path text,
  target_details text,
  author_id bigint,
  author_name String,
  entity_type String,
  details text,
  target_id bigint,
  ip_address inet4,
  created_at DateTime,
  target_type text,
  id bigint
) ENGINE = MergeTree
ORDER BY (id);

Export data from PG:

\copy (select entity_id, coalesce(entity_path, ''), target_details, author_id, coalesce(author_name, ''), entity_type, details, target_id, coalesce(ip_address, '127.0.0.1'), extract(epoch from created_at)::bigint, target_type, id FROM audit_events)to audit_events.csv' with csv

Import to CH:

cat audit_events.csv | clickhouse-client -h 127.0.0.1 -q 'insert into audit_events format CSV'

Callouts

Audit Events as Non-ActiveRecord: Audit_events will no longer be an ActiveRecord and, as a result, cannot be used in queries with other models. This change may require adjustments in the way audit events are queried and combined with other data.
Smaller Instances Affected: Smaller instances might not require moving to ClickHouse due to their lower volume of data and simpler use cases. However, they will have to migrate anyway to maintain a single database solution, which could introduce additional complexity and resource requirements for those instances.

TODO

To confirm with the other teams that rely on the audit_events table for their analysis and queries.

Conclusion:

By adopting ClickHouse as the database solution for audit events, we expect to achieve long-term scalability and improved performance in managing and querying audit events. The benefits of ClickHouse, such as its high-performance analytics capabilities, and scalability make it a suitable choice for addressing the current challenges faced with PostgreSQL. The adoption of ClickHouse is expected to enhance the overall efficiency of audit events reporting, providing valuable insights and a better user experience.

Edited Mar 28, 2023 by Harsimar Sandhu