ClickHouse: Rebuild events and contributions MV
Current state
The current contribution analytics and AI metrics features implemented in ClickHouse are using a de-normalized events table.
List of columns:
CREATE TABLE gitlab_clickhouse_main_production.events
(
`id` UInt64 DEFAULT 0,
`path` String DEFAULT '0/',
`author_id` UInt64 DEFAULT 0,
`target_id` UInt64 DEFAULT 0,
`target_type` LowCardinality(String) DEFAULT '',
`action` UInt8 DEFAULT 0,
`deleted` UInt8 DEFAULT 0,
`created_at` DateTime64(6, 'UTC') DEFAULT now(),
`updated_at` DateTime64(6, 'UTC') DEFAULT now()
)
ENGINE = SharedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}', updated_at, deleted)
PARTITION BY toYear(created_at)
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192
The table contains a special column called path which encodes the namespace+project ancestry into one database column. This allows us to query data on the group level more efficiently.
Events with Siphon
Siphon will be able to replicate tables from PostgreSQL to ClickHouse. De-normalizing data during the replication process will not be possible because table events are independent from each other. Each table populated by Siphon will be very similar to the PG table schema.
Siphon MVP will replicate the following three tables:
eventsprojectsnamespaces
Based on these tables, we can build Materialized Views to build the same schema like the current events table in ClickHouse.
How to generate the path value
This is how the events table looks like in PostgreSQL:
Table "public.events"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-----------------------+--------------------------+-----------+----------+------------------------------------+----------+-------------+--------------+-------------
project_id | bigint | | | | plain | | |
author_id | bigint | | not null | | plain | | |
created_at | timestamp with time zone | | not null | | plain | | |
updated_at | timestamp with time zone | | not null | | plain | | |
action | smallint | | not null | | plain | | |
target_type | character varying | | | | extended | | |
group_id | bigint | | | | plain | | |
fingerprint | bytea | | | | extended | | |
id | bigint | | not null | nextval('events_id_seq'::regclass) | plain | | |
target_id | bigint | | | | plain | | |
imported_from | smallint | | not null | 0 | plain | | |
personal_namespace_id | bigint | | | | plain | | |
- Events can belong to a Group via
group_id, for example an Epic is opened. In this caseproject_idwill be null. - Events can belong to a Project via
project_id, for example an MR is opened. In this casegroup_idwill be null.
If group_id is filled, this is how we build the path:
- Query the
namespacesrecord (1:1). - Take the
traversal_idsarray (example:[9970, 1234]). - Join the array with
/=>9970/1234/
If project_id is filled, this is how we build the path:
- Query the
projectsrecord (1:1). - Via the
projects.project_namespaceID, query thenamespacesrecord (1:1). - Take the
traversal_idsarray (example:[9970, 1234, 4321]). - Join the array with
/=>9970/1234/4321
Update:
To support organization-level queries, let's encode the organization id as the first element.
[1, 9970, 1234, 4321]
Org id is available in both projects and namespaces table so adding it to the CH path column should be straightforward. Note: the examples below needs to be adjusted.
Database schemas
Siphon will replicate data to the following ClickHouse tables.
Events:
CREATE TABLE IF NOT EXISTS siphon_events
(
project_id Nullable(Int64),
author_id Int64,
created_at DateTime64(6, 'UTC'),
updated_at DateTime64(6, 'UTC'),
action Int8,
target_type Nullable(String),
group_id Nullable(Int64),
fingerprint String,
id Int64,
target_id Nullable(Int64),
imported_from Int8 DEFAULT 0,
personal_namespace_id Nullable(Int64),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now(),
_siphon_deleted Boolean DEFAULT FALSE
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY id
Namespaces:
CREATE TABLE IF NOT EXISTS siphon_namespaces
(
id Int64,
name String,
path String,
owner_id Nullable(Int64),
created_at Nullable(DateTime64(6, 'UTC')),
updated_at Nullable(DateTime64(6, 'UTC')),
type String DEFAULT 'User',
description String DEFAULT '',
avatar Nullable(String),
membership_lock Nullable(Boolean) DEFAULT false,
share_with_group_lock Nullable(Boolean) DEFAULT false,
visibility_level Int64 DEFAULT 20,
request_access_enabled Boolean DEFAULT true,
ldap_sync_status String DEFAULT 'ready',
ldap_sync_error Nullable(String),
ldap_sync_last_update_at Nullable(DateTime64(6, 'UTC')),
ldap_sync_last_successful_update_at Nullable(DateTime64(6, 'UTC')),
ldap_sync_last_sync_at Nullable(DateTime64(6, 'UTC')),
description_html Nullable(String),
lfs_enabled Nullable(Boolean),
parent_id Nullable(Int64),
shared_runners_minutes_limit Nullable(Int64),
repository_size_limit Nullable(Int64),
require_two_factor_authentication Boolean DEFAULT false,
two_factor_grace_period Int64 DEFAULT 48,
cached_markdown_version Nullable(Int64),
project_creation_level Nullable(Int64),
runners_token Nullable(String),
file_template_project_id Nullable(Int64),
saml_discovery_token Nullable(String),
runners_token_encrypted Nullable(String),
custom_project_templates_group_id Nullable(Int64),
auto_devops_enabled Nullable(Boolean),
extra_shared_runners_minutes_limit Nullable(Int64),
last_ci_minutes_notification_at Nullable(DateTime64(6, 'UTC')),
last_ci_minutes_usage_notification_level Nullable(Int64),
subgroup_creation_level Nullable(Int64) DEFAULT 1,
emails_disabled Nullable(Boolean),
max_pages_size Nullable(Int64),
max_artifacts_size Nullable(Int64),
mentions_disabled Nullable(Boolean),
default_branch_protection Nullable(Int8),
unlock_membership_to_ldap Nullable(Boolean),
max_personal_access_token_lifetime Nullable(Int64),
push_rule_id Nullable(Int64),
shared_runners_enabled Boolean DEFAULT true,
allow_descendants_override_disabled_shared_runners Boolean DEFAULT false,
traversal_ids INSERT_CH_TYPE,
organization_id Nullable(Int64),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now(),
_siphon_deleted Boolean DEFAULT FALSE
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY id
Projects:
CREATE TABLE IF NOT EXISTS siphon_projects
(
id Int64,
name Nullable(String),
path Nullable(String),
description Nullable(String),
created_at Nullable(DateTime64(6, 'UTC')),
updated_at Nullable(DateTime64(6, 'UTC')),
creator_id Nullable(Int64),
namespace_id Int64,
last_activity_at Nullable(DateTime64(6, 'UTC')),
import_url Nullable(String),
visibility_level Int64 DEFAULT 0,
archived Boolean DEFAULT false,
avatar Nullable(String),
merge_requests_template Nullable(String),
star_count Int64 DEFAULT 0,
merge_requests_rebase_enabled Nullable(Boolean) DEFAULT false,
import_type Nullable(String),
import_source Nullable(String),
approvals_before_merge Int64 DEFAULT 0,
reset_approvals_on_push Nullable(Boolean) DEFAULT true,
merge_requests_ff_only_enabled Nullable(Boolean) DEFAULT false,
issues_template Nullable(String),
mirror Boolean DEFAULT false,
mirror_last_update_at Nullable(DateTime64(6, 'UTC')),
mirror_last_successful_update_at Nullable(DateTime64(6, 'UTC')),
mirror_user_id Nullable(Int64),
shared_runners_enabled Boolean DEFAULT true,
runners_token Nullable(String),
build_allow_git_fetch Boolean DEFAULT true,
build_timeout Int64 DEFAULT 3600,
mirror_trigger_builds Boolean DEFAULT false,
pending_delete Nullable(Boolean) DEFAULT false,
public_builds Boolean DEFAULT true,
last_repository_check_failed Nullable(Boolean),
last_repository_check_at Nullable(DateTime64(6, 'UTC')),
only_allow_merge_if_pipeline_succeeds Boolean DEFAULT false,
has_external_issue_tracker Nullable(Boolean),
repository_storage String DEFAULT 'default',
repository_read_only Nullable(Boolean),
request_access_enabled Boolean DEFAULT true,
has_external_wiki Nullable(Boolean),
ci_config_path Nullable(String),
lfs_enabled Nullable(Boolean),
description_html Nullable(String),
only_allow_merge_if_all_discussions_are_resolved Nullable(Boolean),
repository_size_limit Nullable(Int64),
printing_merge_request_link_enabled Boolean DEFAULT true,
auto_cancel_pending_pipelines Int64 DEFAULT 1,
service_desk_enabled Nullable(Boolean) DEFAULT true,
cached_markdown_version Nullable(Int64),
delete_error Nullable(String),
last_repository_updated_at Nullable(DateTime64(6, 'UTC')),
disable_overriding_approvers_per_merge_request Nullable(Boolean),
storage_version Nullable(Int8),
resolve_outdated_diff_discussions Nullable(Boolean),
remote_mirror_available_overridden Nullable(Boolean),
only_mirror_protected_branches Nullable(Boolean),
pull_mirror_available_overridden Nullable(Boolean),
jobs_cache_index Nullable(Int64),
external_authorization_classification_label Nullable(String),
mirror_overwrites_diverged_branches Nullable(Boolean),
pages_https_only Nullable(Boolean) DEFAULT true,
external_webhook_token Nullable(String),
packages_enabled Nullable(Boolean),
merge_requests_author_approval Nullable(Boolean) DEFAULT false,
pool_repository_id Nullable(Int64),
runners_token_encrypted Nullable(String),
bfg_object_map Nullable(String),
detected_repository_languages Nullable(Boolean),
merge_requests_disable_committers_approval Nullable(Boolean),
require_password_to_approve Nullable(Boolean),
emails_disabled Nullable(Boolean),
max_pages_size Nullable(Int64),
max_artifacts_size Nullable(Int64),
pull_mirror_branch_prefix Nullable(String),
remove_source_branch_after_merge Nullable(Boolean),
marked_for_deletion_at Nullable(Date32),
marked_for_deletion_by_user_id Nullable(Int64),
autoclose_referenced_issues Nullable(Boolean),
suggestion_commit_message Nullable(String),
project_namespace_id Nullable(Int64),
hidden Boolean DEFAULT false,
organization_id Nullable(Int64),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now(),
_siphon_deleted Boolean DEFAULT FALSE
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY id
Materialized views
Store generated path for namespaces:
CREATE TABLE namespace_paths (
id Int64 DEFAULT 0,
path String DEFAULT '0/',
version DateTime64(6, 'UTC') DEFAULT NOW(),
deleted Boolean DEFAULT false
)
ENGINE=ReplacingMergeTree(version, deleted)
PRIMARY KEY id
SETTINGS index_granularity = 512; -- lower granularity so id lookups use less I/O
CREATE MATERIALIZED VIEW namespace_paths_mv
TO namespace_paths
AS
SELECT
id,
if(length(traversal_ids) = 0, '0/', arrayStringConcat(traversal_ids, '/') || '/') as path,
_siphon_replicated_at AS version,
_siphon_deleted AS deleted
FROM siphon_namespaces;
With this materialized view we can look up the path for a given group id quickly:
SELECT path
FROM namespace_paths_mv
WHERE id = 9970
Store the generated path for a given project:
CREATE TABLE project_namespace_paths (
id Int64 DEFAULT 0,
path String DEFAULT '0/',
version DateTime64(6, 'UTC') DEFAULT NOW(),
deleted Boolean DEFAULT false
)
ENGINE=ReplacingMergeTree(version, deleted)
PRIMARY KEY id
SETTINGS index_granularity = 512; -- lower granularity so id lookups use less I/O
The CTE trick is to make the joins more efficient. Otherwise it would load all namespace paths records.
CREATE MATERIALIZED VIEW project_namespace_paths_mv
TO project_namespace_paths
AS
WITH cte AS (
SELECT id, project_namespace_id FROM siphon_projects
), namespaces_cte AS (
SELECT path, id
FROM namespace_paths
WHERE id IN (SELECT project_namespace_id FROM cte)
)
SELECT cte.id, namespaces_cte.path
FROM cte
INNER JOIN namespaces_cte ON namespaces_cte.id = cte.project_namespace_id
1000 rows in set. Elapsed: 0.006 sec. Processed 9.12 thousand rows, 137.56 KB (1.54 million rows/s., 23.25 MB/s.)
Peak memory usage: 910.67 KiB.
Compare it with a simple inner join query:
SELECT siphon_projects.id, namespace_paths.path
FROM siphon_projects
INNER JOIN namespace_paths on namespace_paths.id=siphon_projects.project_namespace_id
limit 1000;
1000 rows in set. Elapsed: 3.865 sec. Processed 66.78 million rows, 2.21 GB (17.28 million rows/s., 571.12 MB/s.)
Peak memory usage: 6.79 GiB.
Rebuild the events table:
CREATE TABLE rebuilt_events (
id Int64 DEFAULT 0,
path String DEFAULT '0/',
action Int8 DEFAULT 0,
target_type LowCardinality(String) DEFAULT '',
target_id Int64 DEFAULT 0,
version DateTime64(6, 'UTC') DEFAULT NOW(),
deleted Boolean DEFAULT false
)
ENGINE=ReplacingMergeTree(version, deleted)
PRIMARY KEY id
MV defintion:
CREATE MATERIALIZED VIEW rebuilt_events_mv
TO rebuilt_events
AS
WITH cte AS (
SELECT * FROM siphon_events
), group_lookups AS (
SELECT id, path
FROM namespace_paths_mv
WHERE id IN (SELECT DISTINCT group_id FROM cte)
), project_lookups AS (
SELECT id, path
FROM project_namespace_paths_mv
WHERE id IN (SELECT DISTINCT project_id FROM cte)
)
SELECT
cte.id AS id,
CASE
WHEN cte.project_id != 0 THEN project_lookups.path
WHEN cte.group_id != 0 THEN group_lookups.path
ELSE '0/'
END AS path,
cte.action AS action,
cte.target_type AS target_type,
cte.target_id AS target_id,
cte._siphon_replicated_at AS version,
cte._siphon_deleted AS deleted
FROM cte
LEFT JOIN group_lookups ON group_lookups.id=cte.group_id
LEFT JOIN project_lookups on project_lookups.id=cte.project_id
Conclusion
With Materialized Views, we can rebuild exactly the same events table and we can also keep the table up to date via the version and deleted flags.
Caveats:
Group hierarchy changes (project is moved) will not going to be reflected in the rebuilt_events table. In ClickHouse, changes on JOIN-ed tables in the MV query are not going to trigger change events. The problem exists even today: we run periodical workers to ensure consistency.
Ideas:
- Periodically rebuild materialized views. (we already have functionality for this).
- Extend the ClickHouse consumer to trigger refresh on some records in the MV when certain conditions are met.