Prepare GitLab ClickHouse DB for Siphon
As part of the MVP work, we plan to enable these tables for Siphon:
- namespaces gitlab-org/gitlab!176809
- projects
- events
- issues
- merge_requests
- namespace_details
- bulk_import_entities
- milestones
- notes
To receive this data in SaaS ClickHouse, we'll need to prepare the database schema in the GitLab application repo.
For each table:
- Create a new
ReplactingMergeTree
table prefixed withsiphon_
. The prefix tells "us" that this table is populated from siphon. - For each column, inspect the data type matrix and do the necessary transformation: https://gitlab.com/gitlab-org/architecture/gitlab-data-analytics/design-doc/-/blob/master/designs/logical_replication_mvp.md#supported-data-types (for example
bigint
->UInt64
) - Create a migration file that creates this database table on ClickHouse.
- Add a test case that ensures schema integrity (column list should match). This is important for detecting schema changes.
Additionally, we need to add the following columns to the table (for the ReplcingMergeTree engine):
- siphon_replicated_at (datetime64)
- siphon_deleted (boolean)
See the research issue for a concrete example.
- Note 1: on self-managed these tables will be empty for now.
- Note 2: it might make sense to write a generator for this, since most of the information can be derived from the PG table / AR model.
Edited by Felipe Cardozo