[BE] POC Contribution Analytics on ClickHouse
Context
Based on https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/2168+ we have the go ahead to start experimenting with ClickHouse.
Contribution Analytics takes an insanely long time to load and the data structure is a single table which would be perfect for a PoC for moving a feature to CH.
Optimizing the retrieval of this data could also allow us to potentially iterate on VSD by introducing contribution metrics.
Proposal
We can split the PoC into 3 steps:
Schema creation and data loading
Assuming that we'll have a CH cluster configured on PRD we'll need to do the following:
- Define and create a database schema for the
events
table based on this guide: https://docs.gitlab.com/ee/development/database/clickhouse/gitlab_activity_data.html - Create a change request issue to export and import the
events
table from PG to CH (some transformation might be required). This speeds up the data sync process since we'll only need to sync the rows between the import date and the current date.
Backend changes
Currently, the Gitlab::ContributionAnalytics::DataCollector
class is responsible for providing data to the view layer (GraphQL). Essentially, we're returning a hash with counts which will be split/processed in Ruby. Let's split this class into two:
-
Gitlab::ContributionAnalytics::DataCollector
=> interface, builds the DB specific data collector (based on a flag) implementation and exposesall_counts
.-
Gitlab::ContributionAnalytics::PostgresqlDataCollector
=> PG implementation -
Gitlab::ContributionAnalytics::ClickHouseDataCollector
=> CH implementation
-
-
Gitlab::ContributionAnalytics::DataFormatter
=> Implements the data formatting such asmerge_requests_merged_by_author_count
andtotals
methods.
Note: A similar approach is described here for Elastic: #407347 (closed)
Example usage:
Gitlab::ContributionAnalytics::DataFormatter.new(group: group).merge_requests_approved_by_author_count
The initializer of the DataFormatter
accepts an optional argument to "inject" the collector.
def initialize(group: group, data_collector: Gitlab::ContributionAnalytics::DataCollector.new(group: group))
end
# later in tests
Gitlab::ContributionAnalytics::DataFormatter.new(group: group, data_collector: mocked_data_collector)
Note: we assume that a connection to the CH will be available in the GitLab application. Based on this blueprint: gitlab-com/www-gitlab-com#14379 (closed)
Keep data in sync
Implement a background job that periodically (5 minutes) sends event records to CH.
How should it work:
- Keep the most recently processed
events.id
value in a variable on Redis (cursor). - If we manage to have a one-time import, take the latest
id
value and set it manually in Redis. (so we won't process the whole table) - Set up a background job that takes N rows (id > cursor_value) every 5 minute and inserts to CH.
Note: this job ensures that new data will land on CH. However, updates/deletions of existing data will not be synced. Once we validated the PoC, we would need to implement a scheduled consistency worker. Or look into an alternative approach of keeping data in sync.