Alternative to pg_dump/pg_restore with anonymization and parallelization (#274) · Issues · PostgresAI / DBLab Engine

Alternative to pg_dump/pg_restore with anonymization and parallelization

Goal === Develop an alternative to standard `pg_dump` and `pg_restore` tools. Requirements: - must work on the fly – without the need to have an intermediate file on disk - parallelization (see TODO section) - fast, based on `COPY` (advanced goal: have parallelized dump/restore of a single table based on COPY) - anonymization (see TODO section) Performance requirement: we saw that for our Demo database, which is ~1 TiB in size, single-threaded dump/restore took ~5h (according to @fomin.list) -- the new tool should be much, must faster despite possible use of anonymization. We have to be able to dump/restore 10-30 TiB RDS databases on daily basis in insane time, using high parallelized processes (many vCPUs involved both on source and target). TODO / How to implement === #### Parallelization - first, we dump/re-create schema, without indexes and materialized views' contents – this can be done in a single thread since it's fast - next, we dump and load data. `COPY` has to be used. Advanced goal: have parallelized dump/restore for a single table too. "Regular" goal: parallelize load of multiple tables, based on parallelization level specified - finally, create indexes and initially-refresh materialized views – in N threads, based on parallelization level specified #### Anonymization (support based on two approaches (both supporting GitOps but in slightly different ways): - SQL comments for columns (maybe for tables too) – for example, if a column's comment contains the substring `~~PII~~`, it has to be anonymized (extension: if it's `~~ANON(hash)~~`, then a hash function is applied to anonymize -- see examples of possible functions in https://postgresql-anonymizer.readthedocs.io/en/stable/masking_functions.html). It is supposed that all PII columns are marked with such comments -- and changes are controlled by responsible persons via GitOps so it's impossible to get PII unanomizied. - annotation files - based on some directory in some Git repo, understand how to extract data, perhaps taking not all the rows – depending on the request (example: `copy (select id as id, email as anon_func_rnd(email) as email from tbl1 where ...) to stdout csv delimiter ',';` also see https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/). It is supposed that all PII columns are marked in such files -- and changes are controlled by responsible persons via GitOps so it's impossible to get PII unanomizied. Questions to solve: - what and how applies anonymization functions? Postgres on the source or Go code? Think: is it acceptable to transfer unanomized data and change it only before saving it to the target - can we offer something that would simplify the initial setup for large existing databases? Acceptance criteria === As a DLE's admin, I can set up a logical dump/restore process with anonymization (controlled via GitOps), fast (parallelized) enough to work with ~10 TiB database.

issue