[research] Group and organization optimized materialized views

Problem statement

Siphon will take care of replicating data from PostgreSQL to ClickHouse. The ClickHouse tables will look very similar to the PostgreSQL tables.

Essentially, these tables are not suitable for doing analytical queries on the project/group/organization levels because the primary key does not support such lookups (by project_id, namespace_id, organization_id). Example tables where this might be a problem: events, issues, merge_requests, epics (core domain objects)

The problem is described in more details in this presentation: https://docs.google.com/presentation/d/1ug8eHxnYpbv2IYWRUNh5e5KpOWK-Q_ylTfVd8NT1IVE/edit?usp=sharing

Idea

To support project/group/organization level queries we would need to build materialized views on top of the siphon's core domain object tables where we also bring in the namespaces.traversal_ids to the materialized view and make it part of the primary key.

A table structure example (events table + traversal ids/path) is described in this issue: #95 (closed)

Note: This structure has the downside of introducing data inconsistency when the group hierarchy changes.

Dealing with group hierarchy changes

  • Option 1: Detect hierarchy changes (group/project moved) and adjust the changes in the clickhouse MV tables. Change detection can be done by Siphon.
    • How to do a large rewrite/reinsert in a consistent manner?
    • Example 1: UPDATE events SET namespace_path = '4/5/6' WHERE namespace_path = '2/3/4'
    • Example 2 (should work for replacing merge tree): INSERT INTO events SELECT [columns here], '4/5/6' AS namespace_path WHERE namespace_path = '2/3/4'
  • Option 2: Periodically rebuild the materialized view using the recently added rebuild feature: https://clickhouse.com/docs/materialized-view/refreshable-materialized-view
    • How I/O intensive is this task?
    • How frequently can we run this?