Skip to content

Parallel Static Masking

Currently the anon.anonymize_database() function runs over all tables sequentially.

We could introduce a bit of parallelism in the process and run static masking over multiple tables at once.

Basic Use Case

A database contains 10 tables with different sizes :

table size of each table
t_a 500MB
t_b and t_c 200MB
t_d, t_e, t_f 90MB
t_g, t_h, t_i, t_j 2MB

With the current implementation, anon.anonymize_database() will statically mask the tables one by one. The process will be run in a single session using only one CPU.

A better approach would be to split the workload in batches. Adding a optionnal parameter named jobs to anon.anonymize_database(). Calling anon.anonymize_database(jobs=4) would create 4 groups of tables, then open a separate session for each group, and run static masking separately.

The groups could be composed so that the total volume is spread accross all groups

Group tables size of the group
group 1 t_a 500MB
group 2 t_d, t_e, t_f 270 MB
group 3 t_b, t_g, t_h, t_i, t_j 208MB
group 4 t_c 200 MB

The execution time of anon.anonymize_database(jobs=4) would be the execution time of the slowest job (here probably the first one).

Edited by damien clochard
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information