Skip to content

Service for rebuilding materialized view

Adam Hegyi requested to merge 431453-poc-for-rebuilding-ch-mvs into master

What does this MR do and why?

This MR adds a service for rebuilding ClickHouse materialized view with fresh data. Worker and scheduling will be added as a follow-up.

Context

To get more context, please check the MV docs first: https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views

In ClickHouse, materialized views are INSERT triggers. The Contribution Analytics feature implements fast reporting queries by using a materialized view on the events table: contributions_mv (stores actual data in the contributions table).

The events table changes over time as records are updated or deleted, these changes are not picked up by the materialized view so the view becomes stale. The RebuildMaterializedViewService can rebuild a materialized views in batches without data loss.

The service requires the following input data:

  • view_name: name of the materialized view (only ReplacingMergeTree views are supported at the moment).
  • tmp_view_name: what will be the temporary view name while we backfill the data.
  • view_table_name: name of the table that backs the materialized view (each MV has its own table that holds data).
  • tmp_view_table_name: name of the temporary table which will receive the backfill data.
  • source_table_name: name of the source table where the MV gets the data.

Procedure:

  1. Create the tmp view and tmp table based on the existing table definitions (clone).
  2. At this point the tmp table will start receiving inserts.
  3. The materialized view contains a SELECT statement which is used for populating the records. This raw query can be taken from the information_schema. Load raw query from the DB and store it in a variable.
  4. Use the ClickHouse iterator to batch over source table via the id column ("primary key").
  5. For each batch, build a query that uses the raw query from the MV definition and replaces the FROM clause with our batched queries. This way, the query is limited to the given batch.
  6. Invoke an insert query into the tmp table which inserts data from the previously built query.
  7. When batching is done, swap the view table and tmp view tables. (atomic)
  8. Clean up the tmp table and tmp view.

Example batched query which will be used for the INSERT INTO statement:

SELECT id,
       argMax(PATH, events.updated_at) AS PATH,
       argMax(author_id, events.updated_at) AS author_id,
       argMax(target_type, events.updated_at) AS target_type,
       argMax(action, events.updated_at) AS action,
       argMax(DATE(created_at), events.updated_at) AS created_at,
       max(events.updated_at) AS updated_at
FROM
  -- this subquery comes from the batch iterator
  (SELECT *
   FROM "events"
   WHERE "events"."id" >= 72686301
     AND "events"."id" <= 82686301) "events"
WHERE ((events.action IN (5,
                          6))
       AND (events.target_type = ''))
  OR ((events.action IN (1,
                         3,
                         7,
                         12))
      AND (events.target_type IN ('MergeRequest',
                                  'Issue',
                                  'WorkItem')))
GROUP BY id

The job and service can be manually triggered on staging. To make it production-ready, we need to do a following (next MR):

  • Redis locking.
  • Error handling.
  • Stop and resume, rebuilding large MVs can take long time.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to test it

  1. Ensure that CH is set up and migrated for your gdk: https://docs.gitlab.com/ee/development/database/clickhouse/clickhouse_within_gitlab.html
  2. Enable the FF: Feature.enable(:event_sync_worker_for_click_house)
  3. If your GDK is seeded, you should have some events, invoke the sync job: ClickHouse::EventsSyncWorker.new.perform
  4. See if you have some data in the contributions materialized view table: ClickHouse::Client.select("select id from contributions final", :main)
  5. Delete some records by id (simulating the inconsistency change): ClickHouse::Client.execute("delete from contributions where id > 160", :main)
  6. Rebuild the MV: ClickHouse::RebuildMaterializedViewCronWorker.new.perform
  7. Records should be restored: ClickHouse::Client.select("select id from contributions final", :main)

Related to #431453 (closed)

Edited by Adam Hegyi

Merge request reports