README.md 9.02 KB
Newer Older
damien clochard's avatar
damien clochard committed
1
![PostgreSQL Anonymizer](images/png_RVB/PostgreSQL-Anonymizer_H_couleur.png)
damien clochard's avatar
damien clochard committed
2

damien clochard's avatar
damien clochard committed
3
Anonymization & Data Masking for PostgreSQL
damien clochard's avatar
init  
damien clochard committed
4
5
===============================================================================

damien clochard's avatar
damien clochard committed
6
7
`postgresql_anonymizer` is an extension to mask or replace
[personally identifiable information] (PII) or commercially sensitive data from
8
a PostgreSQL database.
damien clochard's avatar
damien clochard committed
9

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
10
The project is aiming toward a **declarative approach** of anonymization. This
damien clochard's avatar
damien clochard committed
11
means we're trying to extend PostgreSQL Data Definition Language (DDL) in
12
order to specify the anonymization strategy inside the table definition itself.
13

damien clochard's avatar
damien clochard committed
14
Once the maskings rules are defined, you can access the anonymized data in 3
damien clochard's avatar
damien clochard committed
15
16
17
different ways :

* [Anonymous Dumps] : Simply export the masked data into an SQL file
18
* [Static Masking] : Remove permanently the PII according to the rules
damien clochard's avatar
damien clochard committed
19
20
* [Dynamic Masking] : Hide PII only for the masked users

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
21
22
In addition, various [Masking Functions] are available: randomization, faking,
partial scrambling, shuffling, noise, or even your own custom function!
23

damien clochard's avatar
damien clochard committed
24
Read the [Concepts] section for more details and [NEWS.md] for information
25
about the latest version.
damien clochard's avatar
damien clochard committed
26

27
[NEWS.md]: NEWS.md
damien clochard's avatar
damien clochard committed
28
[INSTALL]: https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/
damien clochard's avatar
damien clochard committed
29
[Concepts]: https://postgresql-anonymizer.readthedocs.io/en/latest/concepts/
damien clochard's avatar
damien clochard committed
30
[personally identifiable information]: https://en.wikipedia.org/wiki/Personally_identifiable_information
damien clochard's avatar
damien clochard committed
31

damien clochard's avatar
damien clochard committed
32
[Anonymous Dumps]: https://postgresql-anonymizer.readthedocs.io/en/latest/anonymous_dumps/
33
[Static Masking]: https://postgresql-anonymizer.readthedocs.io/en/latest/static_masking/
damien clochard's avatar
damien clochard committed
34
35
[Dynamic Masking]: https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/
[Masking Functions]: https://postgresql-anonymizer.readthedocs.io/en/latest/masking_functions/
damien clochard's avatar
damien clochard committed
36
37

Declaring The Masking Rules
damien clochard's avatar
damien clochard committed
38
39
------------------------------------------------------------------------------

damien clochard's avatar
damien clochard committed
40
The main idea of this extension is to offer **anonymization by design**.
damien clochard's avatar
damien clochard committed
41

damien clochard's avatar
damien clochard committed
42
The data masking rules should be written by the people who develop the
damien clochard's avatar
damien clochard committed
43
application because they have the best knowledge of how the data model works.
damien clochard's avatar
damien clochard committed
44
Therefore masking rules must be implemented directly inside the database schema.
45

damien clochard's avatar
damien clochard committed
46
This allows masking the data directly inside the PostgreSQL instance without
damien clochard's avatar
damien clochard committed
47
using an external tool and thus limiting the exposure and the risks of data leak.
damien clochard's avatar
init  
damien clochard committed
48

49
The data masking rules are declared simply by using [security labels] :
damien clochard's avatar
damien clochard committed
50
51

```sql
52
=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
damien clochard's avatar
damien clochard committed
53

damien clochard's avatar
damien clochard committed
54
55
=# SELECT anon.load();

damien clochard's avatar
damien clochard committed
56
=# CREATE TABLE player( id SERIAL, name TEXT, points INT);
damien clochard's avatar
damien clochard committed
57

damien clochard's avatar
damien clochard committed
58
=# SECURITY LABEL FOR anon ON COLUMN player.name
59
-# IS 'MASKED WITH FUNCTION anon.fake_last_name()';
damien clochard's avatar
damien clochard committed
60
61
62

=# SECURITY LABEL FOR anon ON COLUMN player.id
-# IS 'MASKED WITH VALUE NULL';
damien clochard's avatar
damien clochard committed
63
64
```

65
[security labels]: https://www.postgresql.org/docs/current/sql-security-label.html
damien clochard's avatar
damien clochard committed
66

67
Static Masking
damien clochard's avatar
damien clochard committed
68
69
------------------------------------------------------------------------------

70
71
You can permanently remove the PII from a database with
`anon.anonymize_database()`. This will destroy the original data. Use with care.
damien clochard's avatar
damien clochard committed
72
73

```sql
damien clochard's avatar
damien clochard committed
74
=# SELECT * FROM customer;
damien clochard's avatar
damien clochard committed
75
76
77
78
79
 id  |   full_name      |   birth    |    employer   | zipcode | fk_shop
-----+------------------+------------+---------------+---------+---------
 911 | Chuck Norris     | 1940-03-10 | Texas Rangers | 75001   | 12
 112 | David Hasselhoff | 1952-07-17 | Baywatch      | 90001   | 423

damien clochard's avatar
damien clochard committed
80
81
82
83

=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.load();

damien clochard's avatar
damien clochard committed
84
=# SECURITY LABEL FOR anon ON COLUMN customer.full_name
damien clochard's avatar
damien clochard committed
85
86
-# IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';

damien clochard's avatar
damien clochard committed
87
=# SECURITY LABEL FOR anon ON COLUMN customer.birth
Olleg Samoylov's avatar
Olleg Samoylov committed
88
-# IS 'MASKED WITH FUNCTION anon.random_date_between(''1920-01-01''::DATE,now())';
damien clochard's avatar
damien clochard committed
89

90
=# SECURITY LABEL FOR anon ON COLUMN customer.employer
damien clochard's avatar
damien clochard committed
91
92
-# IS 'MASKED WITH FUNCTION anon.fake_company()';

93
=# SECURITY LABEL FOR anon ON COLUMN customer.zipcode
damien clochard's avatar
damien clochard committed
94
95
96
-# IS 'MASKED WITH FUNCTION anon.random_zip()';

=# SELECT anon.anonymize_database();
damien clochard's avatar
damien clochard committed
97
98
99
100
101

=# SELECT * FROM customer;
 id  |     full_name     |   birth    |     employer     | zipcode | fk_shop
-----+-------------------+------------+------------------+---------+---------
 911 | michel Duffus     | 1970-03-24 | Body Expressions | 63824   | 12
102
 112 | andromache Tulip  | 1921-03-24 | Dot Darcy        | 38199   | 423
damien clochard's avatar
damien clochard committed
103

damien clochard's avatar
damien clochard committed
104
105
```

106
107
108
You can also use `anonymize_table()` and `anonymize_column()` to remove data
from a subset of the database.

109

damien clochard's avatar
damien clochard committed
110

damien clochard's avatar
damien clochard committed
111
112
113
114
115
116


Dynamic Masking
------------------------------------------------------------------------------

You can hide the PII from a role by declaring it as a "MASKED". Other roles
damien clochard's avatar
damien clochard committed
117
will still access the original data.
damien clochard's avatar
damien clochard committed
118
119
120

**Example**:

Damien Clochard's avatar
Damien Clochard committed
121
```sql
damien clochard's avatar
damien clochard committed
122
=# SELECT * FROM people;
123
 id | firstname | lastname |   phone
damien clochard's avatar
damien clochard committed
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
----+----------+----------+------------
 T1 | Sarah    | Conor    | 0609110911
(1 row)
```

Step 1 : Activate the dynamic masking engine

```sql
=# CREATE EXTENSION IF NOT EXISTS anon CASCADE;
=# SELECT anon.start_dynamic_masking();
```

Step 2 : Declare a masked user

```sql
=# CREATE ROLE skynet LOGIN;
140
=# SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
damien clochard's avatar
damien clochard committed
141
142
143
144
145
```

Step 3 : Declare the masking rules

```sql
damien clochard's avatar
damien clochard committed
146
=# SECURITY LABEL FOR anon ON COLUMN people.lastname
147
-# IS 'MASKED WITH FUNCTION anon.fake_last_name()';
damien clochard's avatar
damien clochard committed
148

damien clochard's avatar
damien clochard committed
149
=# SECURITY LABEL FOR anon ON COLUMN people.phone
150
-# IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
damien clochard's avatar
damien clochard committed
151
152
153
154
155
156
```

Step 4 : Connect with the masked user

```sql
=# \! psql peopledb -U skynet -c 'SELECT * FROM people;'
157
 id | firstname | lastname  |   phone
damien clochard's avatar
damien clochard committed
158
159
160
161
162
----+----------+-----------+------------
 T1 | Sarah    | Stranahan | 06******11
(1 row)
```

damien clochard's avatar
damien clochard committed
163

damien clochard's avatar
damien clochard committed
164
165
Anonymous Dumps
------------------------------------------------------------------------------
damien clochard's avatar
damien clochard committed
166

damien clochard's avatar
damien clochard committed
167
168
Due to the core design of this extension, you cannot use `pg_dump` with a masked
user. If you want to export the entire database with the anonymized data, you
damien clochard's avatar
damien clochard committed
169
must use the `pg_dump_anon` command line. For example
damien clochard's avatar
damien clochard committed
170
171

```console
damien clochard's avatar
damien clochard committed
172
pg_dump_anon -h localhost -p 5432 -U bob bob_db > dump.sql
damien clochard's avatar
damien clochard committed
173
174
```

damien clochard's avatar
damien clochard committed
175
For more details, please read the [Anonymous Dumps] section.
damien clochard's avatar
damien clochard committed
176
177
178
179
180


Warning
------------------------------------------------------------------------------

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
181
*This is project is at an early stage of development and should be used carefully.*
damien clochard's avatar
damien clochard committed
182

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
183
We need your feedback and ideas! Let us know what you think of this tool, how it
damien clochard's avatar
damien clochard committed
184
185
186
187
fits your needs and what features are missing.

You can either [open an issue] or send a message at <contact@dalibo.com>.

damien clochard's avatar
damien clochard committed
188
[open an issue]: https://gitlab.com/dalibo/postgresql_anonymizer/issues
damien clochard's avatar
damien clochard committed
189

damien clochard's avatar
damien clochard committed
190

damien clochard's avatar
damien clochard committed
191
Requirements
192
--------------------------------------------------------------------------------
damien clochard's avatar
damien clochard committed
193

damien clochard's avatar
damien clochard committed
194
This extension works with all [supported versions of PostgreSQL].
damien clochard's avatar
damien clochard committed
195

damien clochard's avatar
damien clochard committed
196
[supported versions of PostgreSQL]: https://www.postgresql.org/support/versioning/
damien clochard's avatar
damien clochard committed
197

damien clochard's avatar
damien clochard committed
198
It requires 2 extensions called [tsm_system_rows] and [pgcrypto] which are
damien clochard's avatar
damien clochard committed
199
delivered by the `postgresql-contrib` package of the main linux distributions.
damien clochard's avatar
damien clochard committed
200
201

[tsm_system_rows]: https://www.postgresql.org/docs/current/tsm-system-rows.html
damien clochard's avatar
damien clochard committed
202
[pgcrypto]: https://www.postgresql.org/docs/current/pgcrypto.html
damien clochard's avatar
damien clochard committed
203

204

damien clochard's avatar
damien clochard committed
205
206
207
Install
-------------------------------------------------------------------------------

208
_Step 1._ Install the extension on the server with :
damien clochard's avatar
damien clochard committed
209
210
211
212

```console
sudo pgxn install postgresql_anonymizer
```
damien clochard's avatar
damien clochard committed
213

damien clochard's avatar
damien clochard committed
214
_Step 2:_  Load the extension in the database you want to anonymize
215

216
```sql
damien clochard's avatar
damien clochard committed
217
ALTER DATABASE foo SET session_preload_libraries = 'anon';
218
```
damien clochard's avatar
damien clochard committed
219

damien clochard's avatar
damien clochard committed
220
221
222
There are other ways to install and load the extension. You can read the [INSTALL]
section for detailed instructions or if you want to deploy it on Amazon RDS or
some other DBaaS provider.
damien clochard's avatar
damien clochard committed
223

damien clochard's avatar
damien clochard committed
224
225


226

227
228
229
Limitations
------------------------------------------------------------------------------

damien clochard's avatar
damien clochard committed
230
231
232
* The dynamic masking system only works with one schema (by default `public`).
  When you start the masking engine with `start_dynamic_masking()`, you can
  specify the schema that will be masked with `SELECT start_dynamic_masking('sales');`.
233
  **However** static masking with `anon.anonymize()`and anonymous
234
  export with `anon.dump()` will work fine with multiple schemas.
235

236
237


damien clochard's avatar
damien clochard committed
238
239
240
Performance
------------------------------------------------------------------------------

damien clochard's avatar
damien clochard committed
241
242
243
So far, we've done very few performance tests. Depending on the size of your
data set and number of columns your need to anonymize, you might end up with a
very slow process.
damien clochard's avatar
damien clochard committed
244

damien clochard's avatar
damien clochard committed
245
Here's some ideas:
damien clochard's avatar
damien clochard committed
246
247
248

### Sampling

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
249
If you need to anonymize data for testing purpose, chances are that a smaller
damien clochard's avatar
damien clochard committed
250
subset of your database will be enough. In that case, you can easily speed up
251
the anonymization by downsizing the volume of data. There are multiple ways to
Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
252
extract a sample of database:
damien clochard's avatar
damien clochard committed
253
254
255
256
257
258

* [TABLESAMPLE](https://www.postgresql.org/docs/current/static/sql-select.html)
* [pg_sample](https://github.com/mla/pg_sample)



damien clochard's avatar
damien clochard committed
259
### Materialized Views
damien clochard's avatar
damien clochard committed
260

Nikolay Samokhvalov's avatar
Nikolay Samokhvalov committed
261
Dynamic masking is not always required! In some cases, it is more efficient
damien clochard's avatar
damien clochard committed
262
to build [Materialized Views] instead.
damien clochard's avatar
damien clochard committed
263

damien clochard's avatar
damien clochard committed
264
For instance:
damien clochard's avatar
damien clochard committed
265

damien clochard's avatar
damien clochard committed
266
```sql
damien clochard's avatar
damien clochard committed
267
CREATE MATERIALIZED VIEW masked_customer AS
damien clochard's avatar
damien clochard committed
268
SELECT
damien clochard's avatar
damien clochard committed
269
270
    id,
    anon.random_last_name() AS name,
Olleg Samoylov's avatar
Olleg Samoylov committed
271
    anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
damien clochard's avatar
damien clochard committed
272
    fk_last_order,
damien clochard's avatar
damien clochard committed
273
    store_id
damien clochard's avatar
damien clochard committed
274
275
276
FROM customer;
```

damien clochard's avatar
damien clochard committed
277
[Materialized Views]: https://www.postgresql.org/docs/current/static/sql-creatematerializedview.html
278

damien clochard's avatar
damien clochard committed
279