[BE] POC Contribution Analytics on ClickHouse

Context

Based on https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/2168+ we have the go ahead to start experimenting with ClickHouse.

Contribution Analytics takes an insanely long time to load and the data structure is a single table which would be perfect for a PoC for moving a feature to CH.

Optimizing the retrieval of this data could also allow us to potentially iterate on VSD by introducing contribution metrics.

Proposal

We can split the PoC into 3 steps:

Schema creation and data loading

Assuming that we'll have a CH cluster configured on PRD we'll need to do the following:

Define and create a database schema for the events table based on this guide: https://docs.gitlab.com/ee/development/database/clickhouse/gitlab_activity_data.html
Create a change request issue to export and import the events table from PG to CH (some transformation might be required). This speeds up the data sync process since we'll only need to sync the rows between the import date and the current date.

Backend changes

Currently, the Gitlab::ContributionAnalytics::DataCollector class is responsible for providing data to the view layer (GraphQL). Essentially, we're returning a hash with counts which will be split/processed in Ruby. Let's split this class into two:

Gitlab::ContributionAnalytics::DataCollector => interface, builds the DB specific data collector (based on a flag) implementation and exposes all_counts.
- Gitlab::ContributionAnalytics::PostgresqlDataCollector => PG implementation
- Gitlab::ContributionAnalytics::ClickHouseDataCollector => CH implementation
Gitlab::ContributionAnalytics::DataFormatter => Implements the data formatting such as merge_requests_merged_by_author_count and totals methods.

Note: A similar approach is described here for Elastic: #407347 (closed)

Example usage:

  Gitlab::ContributionAnalytics::DataFormatter.new(group: group).merge_requests_approved_by_author_count

The initializer of the DataFormatter accepts an optional argument to "inject" the collector.

def initialize(group: group, data_collector: Gitlab::ContributionAnalytics::DataCollector.new(group: group))
end

# later in tests

Gitlab::ContributionAnalytics::DataFormatter.new(group: group, data_collector: mocked_data_collector)

Note: we assume that a connection to the CH will be available in the GitLab application. Based on this blueprint: gitlab-com/www-gitlab-com#14379 (closed)

Keep data in sync

Implement a background job that periodically (5 minutes) sends event records to CH.

How should it work:

Keep the most recently processed events.id value in a variable on Redis (cursor).
If we manage to have a one-time import, take the latest id value and set it manually in Redis. (so we won't process the whole table)
Set up a background job that takes N rows (id > cursor_value) every 5 minute and inserts to CH.

Note: this job ensures that new data will land on CH. However, updates/deletions of existing data will not be synced. Once we validated the PoC, we would need to implement a scheduled consistency worker. Or look into an alternative approach of keeping data in sync.

Edited May 15, 2023 by Adam Hegyi