Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Alternative to pg_dump/pg_restore with anonymization and parallelization
Goal
===
Develop an alternative to standard `pg_dump` and `pg_restore` tools. Requirements:
- must work on the fly – without the need to have an intermediate file on disk
- parallelization (see TODO section)
- fast, based on `COPY` (advanced goal: have parallelized dump/restore of a single table based on COPY)
- anonymization (see TODO section)
Performance requirement: we saw that for our Demo database, which is ~1 TiB in size, single-threaded dump/restore took ~5h (according to @fomin.list) -- the new tool should be much, must faster despite possible use of anonymization. We have to be able to dump/restore 10-30 TiB RDS databases on daily basis in insane time, using high parallelized processes (many vCPUs involved both on source and target).
TODO / How to implement
===
#### Parallelization
- first, we dump/re-create schema, without indexes and materialized views' contents – this can be done in a single thread since it's fast
- next, we dump and load data. `COPY` has to be used. Advanced goal: have parallelized dump/restore for a single table too. "Regular" goal: parallelize load of multiple tables, based on parallelization level specified
- finally, create indexes and initially-refresh materialized views – in N threads, based on parallelization level specified
#### Anonymization (support based on two approaches (both supporting GitOps but in slightly different ways):
- SQL comments for columns (maybe for tables too) – for example, if a column's comment contains the substring `~~PII~~`, it has to be anonymized (extension: if it's `~~ANON(hash)~~`, then a hash function is applied to anonymize -- see examples of possible functions in https://postgresql-anonymizer.readthedocs.io/en/stable/masking_functions.html). It is supposed that all PII columns are marked with such comments -- and changes are controlled by responsible persons via GitOps so it's impossible to get PII unanomizied.
- annotation files - based on some directory in some Git repo, understand how to extract data, perhaps taking not all the rows – depending on the request (example: `copy (select id as id, email as anon_func_rnd(email) as email from tbl1 where ...) to stdout csv delimiter ',';` also see https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/). It is supposed that all PII columns are marked in such files -- and changes are controlled by responsible persons via GitOps so it's impossible to get PII unanomizied.
Questions to solve:
- what and how applies anonymization functions? Postgres on the source or Go code? Think: is it acceptable to transfer unanomized data and change it only before saving it to the target
- can we offer something that would simplify the initial setup for large existing databases?
Acceptance criteria
===
As a DLE's admin, I can set up a logical dump/restore process with anonymization (controlled via GitOps), fast (parallelized) enough to work with ~10 TiB database.
issue