Commit 2d277366 authored by damien clochard's avatar damien clochard 🐘
Browse files

[doc] update the README

parent 880a0501
Pipeline #508791006 passed with stages
in 7 minutes and 52 seconds
......@@ -11,12 +11,13 @@ The project relies on a **declarative approach** of anonymization. This means
we're using the PostgreSQL Data Definition Language (DDL) in order to specify
the anonymization strategy inside the table definition itself.
Once the masking rules are defined, you can access the anonymized data in 3
Once the masking rules are defined, you can access the anonymized data in
different ways :
* [Anonymous Dumps] : Simply export the masked data into an SQL file
* [Static Masking] : Remove permanently the PII according to the rules
* [Dynamic Masking] : Hide PII only for the masked users
* [Generalization] : Reducing the accuracy of dates and numbers
In addition, various [Masking Functions] are available: randomization, faking,
partial scrambling, shuffling, noise, or even your own custom function!
......@@ -33,6 +34,7 @@ about the latest version.
[Static Masking]: https://postgresql-anonymizer.readthedocs.io/en/latest/static_masking/
[Dynamic Masking]: https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/
[Masking Functions]: https://postgresql-anonymizer.readthedocs.io/en/latest/masking_functions/
[Generalization]: https://postgresql-anonymizer.readthedocs.io/en/latest/generalization/
Declaring The Masking Rules
------------------------------------------------------------------------------
......@@ -51,7 +53,7 @@ The data masking rules are declared simply by using [security labels] :
```sql
=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.load();
=# SELECT anon.init();
=# CREATE TABLE player( id SERIAL, name TEXT, points INT);
......@@ -77,10 +79,6 @@ You can permanently remove the PII from a database with
911 | Chuck Norris | 1940-03-10 | Texas Rangers | 75001 | 12
112 | David Hasselhoff | 1952-07-17 | Baywatch | 90001 | 423
=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.load();
=# SECURITY LABEL FOR anon ON COLUMN customer.full_name
-# IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
......@@ -99,7 +97,7 @@ You can permanently remove the PII from a database with
id | full_name | birth | employer | zipcode | fk_shop
-----+-------------------+------------+------------------+---------+---------
911 | michel Duffus | 1970-03-24 | Body Expressions | 63824 | 12
112 | andromach Tulip | 1921-03-24 | Dot Darcy | 38199 | 423
112 | andromach Tulip | 1921-03-24 | Dot Darcy | 38199 | 423
```
......@@ -169,10 +167,10 @@ user. If you want to export the entire database with the anonymized data, you
must use the `pg_dump_anon` command line. For example
```console
pg_dump_anon -h localhost -p 5432 -U bob bob_db > dump.sql
pg_dump_anon.sh -h localhost -p 5432 -U bob bob_db > dump.sql
```
For more details, please read the [Anonymous Dumps] section.
For more details, read the [Anonymous Dumps] section.
Support
......@@ -193,33 +191,16 @@ This extension works with all [supported versions of PostgreSQL].
[supported versions of PostgreSQL]: https://www.postgresql.org/support/versioning/
It requires 2 extensions called [tsm_system_rows] and [pgcrypto] which are
delivered by the `postgresql-contrib` package of the main linux distributions.
It requires an extension called [pgcrypto] which is delivered by the
`postgresql-contrib` package of the main linux distributions.
[tsm_system_rows]: https://www.postgresql.org/docs/current/tsm-system-rows.html
[pgcrypto]: https://www.postgresql.org/docs/current/pgcrypto.html
Install
-------------------------------------------------------------------------------
_Step 1._ Install the extension on the server with :
```console
sudo pgxn install postgresql_anonymizer
```
_Step 2:_ Load the extension in the database you want to anonymize
```sql
ALTER DATABASE foo SET session_preload_libraries = 'anon';
```
There are other ways to install and load the extension. You can read the [INSTALL]
section for detailed instructions or if you want to deploy it on Amazon RDS or
some other DBaaS provider.
See the [INSTALL] section
Limitations
......@@ -227,7 +208,7 @@ Limitations
* The dynamic masking system only works with one schema (by default `public`).
When you start the masking engine with `start_dynamic_masking()`, you can
specify the schema that will be masked with `SELECT start_dynamic_masking('sales');`.
specify the schema that will be masked with.
**However** static masking with `anon.anonymize()`and [Anonymous Dumps] will
work fine with multiple schemas.
......@@ -239,42 +220,10 @@ Limitations
Performance
------------------------------------------------------------------------------
So far, we've done very few performance tests. Depending on the size of your
data set and number of columns your need to anonymize, you might end up with a
very slow process.
Here's some ideas:
### Sampling
If you need to anonymize data for testing purpose, chances are that a smaller
subset of your database will be enough. In that case, you can easily speed up
the anonymization by downsizing the volume of data. There are multiple ways to
extract a sample of database:
* [TABLESAMPLE](https://www.postgresql.org/docs/current/static/sql-select.html)
* [pg_sample](https://github.com/mla/pg_sample)
See [docs/performances.md]
[docs/performances.md]: https://postgresql-anonymizer.readthedocs.io/en/latest/performances/
### Materialized Views
Dynamic masking is not always required! In some cases, it is more efficient
to build [Materialized Views] instead.
For instance:
```sql
CREATE MATERIALIZED VIEW masked_customer AS
SELECT
id,
anon.random_last_name() AS name,
anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
fk_last_order,
store_id
FROM customer;
```
[Materialized Views]: https://www.postgresql.org/docs/current/static/sql-creatematerializedview.html
......@@ -19,7 +19,6 @@ different ways :
* [Static Masking] : Remove the PII according to the rules
* [Dynamic Masking] : Hide PII only for the masked users
In addition, various [Masking Functions] are available : randomization, faking,
partial scrambling, shuffling, noise or even your own custom function!
......
......@@ -66,3 +66,44 @@ take 2 hours.
> In this case, the cost of anonymization is "paid" by the user asking for the
> anonymous export. Other users of the database will not be affected.
How to speed things up ?
------------------------------------------------------------------------------
### Prefer `MASKED WITH VALUE` whenever possible
It is always faster to replace the original data with a static value instead
of calling a masking function.
### Sampling
If you need to anonymize data for testing purpose, chances are that a smaller
subset of your database will be enough. In that case, you can easily speed up
the anonymization by downsizing the volume of data. There are multiple ways to
extract a sample of database:
* [TABLESAMPLE](https://www.postgresql.org/docs/current/static/sql-select.html)
* [pg_sample](https://github.com/mla/pg_sample)
### Materialized Views
Dynamic masking is not always required! In some cases, it is more efficient
to build [Materialized Views] instead.
For instance:
```sql
CREATE MATERIALIZED VIEW masked_customer AS
SELECT
id,
anon.random_last_name() AS name,
anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
fk_last_order,
store_id
FROM customer;
```
[Materialized Views]: https://www.postgresql.org/docs/current/static/sql-creatematerializedview.html
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment