[CA PoC] [optional] Load PG `events` table to CH
Why are we doing this work
Optional
This issue might be optional since we can achieve similar results with the #414937 (closed). We can skip this issue if we're ok having data only from the last few weeks.
We need to do a one-time sync from the PostgreSQL database to have historical data for the Contribution Analytics feature.
Implementation plan
Depending on the database schema there should be a one-time export from the PostgreSQL database to CSV and import to the ClickHouse database.
How:
- Reach out to the infrastructure team and ask for guidance.
- Prepare a SQL query that contains all data we need to be present in ClickHouse.
- If the exported data cannot be imported into the ClickHouse DB table (
events) without some data transformation, load the raw data into a temporary table. - Fill up the
eventsdatabase table on ClickHouse from the temporary table usingSELECT + INSERTwith the necessary data transformations.
SQL query:
\copy (
SELECT
events.id AS id,
(
CASE
WHEN project_id IS NOT NULL THEN (SELECT array_to_string(traversal_ids, '/') || '/' FROM namespaces WHERE id = (SELECT project_namespace_id FROM projects WHERE id = events.project_id LIMIT 1) LIMIT 1)
WHEN group_id IS NOT NULL THEN (SELECT array_to_string(traversal_ids, '/') || '/' FROM namespaces WHERE id = events.group_id)
ELSE ''
END
) AS path,
events.author_id AS author_id,
events.target_id AS target_id,
events.target_type AS target_type,
events.action AS action,
EXTRACT(epoch FROM created_at) AS created_at,
EXTRACT(epoch FROM updated_at) AS updated_at
FROM events
ORDER BY id
) TO events_with_path.csv WITH DELIMITER ',' CSV HEADER;
The query creates a csv file with the exact structure that we need for the ClickHouse schema. Once the ClickHouse database is ready (events table is present), we can import the CSV:
clickhouse-client -d gitlab_clickhouse_production --password -q 'INSERT INTO events FORMAT CSVWithNames' < events_with_path.csv
Edited by Adam Hegyi