Parallel Static Masking
Currently the anon.anonymize_database() function runs over all tables sequentially.
We could introduce a bit of parallelism in the process and run static masking over multiple tables at once.
Basic Use Case
A database contains 10 tables with different sizes :
| table | size of each table |
|---|---|
| t_a | 500MB |
| t_b and t_c | 200MB |
| t_d, t_e, t_f | 90MB |
| t_g, t_h, t_i, t_j | 2MB |
With the current implementation, anon.anonymize_database() will statically mask the tables one by one. The process will be run in a single session using only one CPU.
A better approach would be to split the workload in batches. Adding a optionnal parameter named jobs to anon.anonymize_database(). Calling anon.anonymize_database(jobs=4) would create 4 groups of tables, then open a separate session for each group, and run static masking separately.
The groups could be composed so that the total volume is spread accross all groups
| Group | tables | size of the group |
|---|---|---|
| group 1 | t_a | 500MB |
| group 2 | t_d, t_e, t_f | 270 MB |
| group 3 | t_b, t_g, t_h, t_i, t_j | 208MB |
| group 4 | t_c | 200 MB |
The execution time of anon.anonymize_database(jobs=4) would be the execution time of the slowest job (here probably the first one).