Skip to content

[CA PoC] Revisit the database table schema and query

Why are we doing this work

To efficiently query the ClickHouse database, we need a database schema that supports the Contribution Analytics-related database queries (as of now, it seems like we'll need one query).

There is already a basic database schema used when running tests (#414588 (closed)). The task here is to finalize (experiment and measure) the CH-specifc DB schema (events) and DB query.

A rudimentary version of the DB query is available within the CH experiment documentation: https://docs.gitlab.com/ee/development/database/clickhouse/gitlab_activity_data.html#clickhouse-database-queries

What to focus on:

Contribution Analytics works on the group level including subgroups. The table schema should make it easy to query the group hierarchy. Having the project_id column is probably not enough since then we would have to load thousands of project_ids from PostgreSQL and pipe it into the ClickHouse query (project_id IN (1,2,3, ...).

Ideas:

  • Add traversal_ids or full_path to the CH table.
  • ???

As a result, we should have a ready-to-use database schema and DB query that we can use for Contribution Analytics.

Edited by Adam Hegyi