Commit d77cba07 authored by Anatoly Stansler's avatar Anatoly Stansler 🎯

Merge branch 'use-cases' into 'master'

feat: masking and data sources guides

See merge request postgres-ai/docs!110
parents 3ce21d62 210f5724
......@@ -3,7 +3,7 @@ title: Overview of postgres-checkup
sidebar_label: Checkup
---
(Ops) **`postgres-checkup`** – a powerful tool automating health checks of PostgreSQL databases. Its key features are unobtrusiveness, "zero install", and complex and deep analysis of a whole PostgreSQL set of nodes (primary plus its followers). With `postgres-checkup`, an experienced DBA spends only 4 hours instead of 2 weeks to analyze a heavily-loaded PostgreSQL setup when seeing it for the first time.
**`postgres-checkup`** – a powerful tool automating health checks of PostgreSQL databases. Its key features are unobtrusiveness, "zero install", and complex and deep analysis of a whole PostgreSQL set of nodes (primary plus its followers). With `postgres-checkup`, an experienced DBA spends only 4 hours instead of 2 weeks to analyze a heavily-loaded PostgreSQL setup when seeing it for the first time.
## General
......
......@@ -10,7 +10,7 @@ hide_title: false
## Guides
- [Start using Database Lab](/docs/tutorials/engine-setup)
- [Start using Database Lab with AWS RDS](/docs/tutorials/engine-setup-rds)
- [Start using Database Lab with AWS RDS](/docs/tutorials/database-lab-tutorial-amazon-rds)
## References
......@@ -22,7 +22,7 @@ hide_title: false
## Overview
**Database Lab** – the core component based on which powerful, state-of-the-art development and testing environments are built. It is based on a simple idea: with modern thin cloning technologies, it becomes possible to iterate 100x faster in development and testing. It is extremely helpful for larger companies that want to achieve high development velocity and the most competitive "time to market" characteristics.
**Database Lab Engine** – the core component based on which powerful, state-of-the-art development and testing environments are built. It is based on a simple idea: with modern thin cloning technologies, it becomes possible to iterate 100x faster in development and testing. It is extremely helpful for larger companies that want to achieve high development velocity and the most competitive "time to market" characteristics.
Database Lab aims to speed up software development in fast-growing organizations that use large PostgreSQL databases. It is achieved by enabling extremely fast and low-budget cloning of large databases.
......
---
title: Database Lab data masking
sidebar_label: Data masking
---
## Premasking
Allows to easialy setup dynamic masking rules, without actual data changing. Data is masked only during logical dump on the side of the database.
### Option 1а. Anonymized dump
Database Lab retrieves the data from production in the form of an anonymized logical dump. Requires additional masking set up on the production.
![Premasking / Option 1а. Anonymized dump](/docs/assets/masking-1a-dump.png)
#### Pros
- PII only on production
- Identical data structure for development and optimization
#### Cons
- Anonymization affects production performance
### Option 1b. Anonymized dump using additional Database Lab Engine
Database Lab Engine in the production infrastructure is used to create an anonymized dump. Database Lab Engine in the test/dev/staging environment retrieves the data from the production in the form of an anonymized dump.
![Premasking / Option 1b. Anonymized dump using additional Database Lab Engine](/docs/assets/masking-1b-dump-add.png)
#### Pros
- PII only on production
- Identical data structure for development and optimization
- Data anonymization without affecting the production database
- Data recovery and heavy analytics queries without affecting the production database
#### Cons
Deployment of additional Database Lab Engine
## Postmasking
Allows to set up dynamic masking rules, without actual data changing. Data is dynamically masked on the side of the Database Lab Engine.
### Option 2a. Database Lab Engine on production
Database Lab Engine deployed only on production infrastructure and works with PII. Depending on the access level developers may or may not have access to the unmasked data.
![Postmasking / Option 2a. Database Lab Engine on production](/docs/assets/masking-2a-production.png)
#### Pros
- PII only on production
- Identical data structure for development and optimization
- Data anonymization without affecting the production database
- Data recovery and heavy analytics queries without affecting the production database
- Easy to deploy
#### Cons
- High requirements for security administration
- Harder to configure access for developers to use the Database Lab
### Option 2b. Database Lab Engine in Test/Dev/Staging
Database Lab Engine deployed only on test/dev/staging infrastructure and works with PII. Developers work with masked data.
![Postmasking / Option 2b. Database Lab Engine in Test/Dev/Staging](/docs/assets/masking-2b-staging.png)
#### Pros
- Identical data structure for development and optimization
- Data anonymization without affecting the production database
- Data recovery and heavy analytics queries without affecting the production database
- Easy to deploy
#### Cons
- PII physically copied from the production infrastructure (but cannot be accessed by developers)
- High requirements for security administration of test/dev/staging environments
## Obfuscation
Instead of masking, the data can be deleted permanently, e.g. during snapshot creation.
Options:
- Use custom obfuscation script (define it using `preprocessingScript` option of [`logicalSnapshot`](/docs/database-lab/config-reference#job-logicalsnapshot) or [`physicalSnapshot`](/docs/database-lab/config-reference#job-physicalsnapshot) jobs);
- Use the [In-place anonymization](https://postgresql-anonymizer.readthedocs.io/en/latest/in_place_anonymization/) of PostgreSQL Anonymizer.
---
id: get-started
title: Getting Started
title: Getting started with Database Lab
hide_title: false
sidebar_label: Getting Started
sidebar_label: Getting started
---
## Use cases & products
- [Database Lab engine / Thin clones provisioning framework](/docs/database-lab)
- [Staging with superpowers / Deploy disposable databases for development and review in seconds](/docs/staging)
- [Joe bot / SQL optimization made simple](/docs/joe-bot)
- [Checkup / Detect database bottlenecks before they appear](/docs/checkup)
- [Data recovery / Instantaneously recover lost data](/docs/data-recovery)
- [Database changes CI/CD / Minimize downtime and code refactoring caused by database migrations](/docs/database-changes-cicd)
- [Data access / Analyze your data without affecting production](/docs/data-access)
| | |
| ----------- | ----------- |
| [Database Lab Engine](/docs/database-lab)<br>Open-source technology to clone databases of any size in seconds | [Joe, SQL optimization chatbot](/docs/joe-bot)<br>Run `EXPLAIN ANALYZE` and optimize SQL on instantly provisioned full-size database copies |
| [Dev/QA/Staging databases with superpowers](/docs/staging)<br>Develop and test using full-size database copies provisioned in seconds | [CI/CD observer for DB schema changes](/docs/database-changes-cicd)<br>Prevent performance degrataion and downtime when deploying database schema changes |
| [postgres-checkup](/docs/checkup)<br>Automated health-checks and query analsysis for heavily-loaded PostgreSQL databases | [Detached replicas](/docs/data-access)<br>Use BI tools, run analytical queries, perform data export without replication lags and bloat |
<!--#### [Data recovery / Instantaneously recover lost data](/docs/data-recovery)
Recover accidentally deleted data. Using thin cloning, the point-in-time recovery (PITR) can be performed without long waiting.
-->
## What is Database Lab?
......
---
title: Administration guides
title: Database Lab administration
sidebar_label: Administration
---
- [How to manage Database Lab Engine](/docs/guides/administration/engine-manage)
- [How to manage Joe Bot](/docs/guides/administration/manage-joe)
- [Secure Database Lab Engine](/docs/guides/administration/engine-secure)
- [Set up machine for Database Lab Engine](/docs/guides/administration/machine-setup)
[↵ Back to Guides](/docs/guides/)
---
title: Setup machine for the Database Lab Engine
sidebar_label: Setup machine for the Database Lab Engine
---
[↵ Back to Administration guides](/docs/guides/administration)
## Prepare a machine
Create an EC2 instance with Ubuntu 18.04 and an additional EBS volume to store data. You can find detailed instructions on how to create an AWS EC2 instance [here](https://docs.aws.amazon.com/efs/latest/ug/gs-step-one-create-ec2-resources.html) (if you want to use Google Cloud, see [the GCP documentation](https://cloud.google.com/compute/docs/instances/create-start-instance)).
## (optional) Ports need to be open in the Security Group being used
You will need to allow working with the following ports (outbound rules in your Security Group):
- `22`: to connect to the instance using SSH;
- `2345`: to work with Database Lab Engine API (can be changed in the Database Lab Engine configuration file);
- `6000-6100`: to connect to PostgreSQL clones (this is default port range used in the Database Lab Engine configuration file, can be chanfed if needed).
> For real-life use, it is not a good idea to open ports to the public. Instead, it is recommended to use VPN or SSH port forwarding to access both Database Lab API and PostgreSQL clones, or to enforce encryption for all connections using NGINX with SSL and configuring SSL in PostgreSQL configuration.
Additionally, to be able to install software, allow accessing external resources using HTTP/HTTPS (edit inbound rule in your Security Group):
- `80` for HTTP;
- `443` for HTTPS.
Here is how the inbound and outbound rules in your Security Group may look like:
![Database Lab architecture](/docs/assets/ec2_security_group_inbound.png)
![Database Lab architecture](/docs/assets/ec2_security_group_outbound.png)
## Install Docker
If needed, you can find the detailed installation guides for Docker [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
Install dependencies:
```bash
sudo apt-get update && sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common
```
Install Docker:
```bash
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update && sudo apt-get install -y \
docker-ce \
docker-ce-cli \
containerd.io
```
## Set $DBLAB_DISK
Further, we will need `$DBLAB_DISK` environment variable. It must contain the device name corresponding the disk where all the Database Lab Engine data will be stored.
To understand what needs to be specified in `$DBLAB_DISK` in your case, check the output of `lsblk`:
```bash
sudo lsblk
```
Some examples:
- **AWS local ephemeral NVMe disks; EBS volumes for instances built on [the Nitro system](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html)**:
```bash
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:0 0 777G 0 disk
$ export DBLAB_DISK="/dev/nvme0n1"
```
- **AWS EBS volumes for older (pre-Nitro) EC2 instances**:
```bash
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdb 202:16 0 777G 0 disk
$ export DBLAB_DISK="/dev/xvdb"
```
## Set up either ZFS or LVM to enable thin cloning
ZFS is a recommended way to enable thin cloning in Database Lab. LVM is also available, but has certain limitations:
- much less flexible disk space consumption and risks for a clone to be destroyed during massive operations in it,
- inability to work with multiple snapshots ("time travel"), cloning always happens based on the most recent version of data.
<!--DOCUSAURUS_CODE_TABS-->
<!--ZFS-->
## Option 1: Use ZFS
Install ZFS:
```bash
sudo apt-get install -y zfsutils-linux
```
Create a new ZFS storage pool (make sure `$DBLAB_DISK` has the correct value, see the previous step!):
```bash
sudo zpool create -f \
-O compression=on \
-O atime=off \
-O recordsize=8k \
-O logbias=throughput \
-m /var/lib/dblab/data \
dblab_pool \
"${DBLAB_DISK}"
```
And check the result using `zfs list` and `lsblk`, it has to be like this:
```bash
$ sudo zfs list
NAME USED AVAIL REFER MOUNTPOINT
dblab_pool 106K 777G 24K /var/lib/dblab/data
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:0 0 777G 0 disk
├─nvme0n1p1 259:3 0 777G 0 part
└─nvme0n1p9 259:4 0 8M 0 part
```
<!--LVM-->
## Option 1: Use LVM
Install LVM2:
```bash
sudo apt-get install -y lvm2
```
Create a LVM volume (make sure `$DBLAB_DISK` has the correct value, see the previous step!):
```bash
# Create Physical Volume and Volume Group
sudo pvcreate "${DBLAB_DISK}"
sudo vgcreate dblab_vg "${DBLAB_DISK}"
# Create Logical Volume for PGDATA
sudo lvcreate -l 10%FREE -n pg_lv dblab_vg
sudo mkfs.ext4 /dev/dblab_vg/pg_lv
sudo mkdir -p /var/lib/dblab/{data,clones,sockets}
sudo mount /dev/dblab_vg/pg_lv /var/lib/dblab/data
# Create PGDATA directory
sudo mkdir -p /var/lib/dblab/data/pgdata
# Bootstrap LVM snapshots so they could be used inside Docker containers
sudo lvcreate --snapshot --extents 10%FREE --yes --name dblab_bootstrap dblab_vg/pg_lv
sudo lvremove --yes dblab_vg/dblab_bootstrap
```
>Logical volume size needs to be defined at volume creation time. By default, we allocate 10% of the available memory. If the volume size exceeds the allocated memory volume will be destroyed, potentially leading to data losses. To prevent volumes from being destroyed, consider enabling the LVM auto-extend feature.
To enable the auto-extend feature, the following LVM configuration options need to be updated:
- `snapshot_autoextend_threshold`: auto-extend a "snapshot" volume when its usage exceeds the specified percentage,
- `snapshot_autoextend_percent`: auto-extend a "snapshot" volume by the specified percentage of the available space once the usage exceeds the threshold.
Update LVM configuration (located in `/etc/lvm/lvm.conf` by default):
```bash
sudo sed -i 's/snapshot_autoextend_threshold.*/snapshot_autoextend_threshold = 70/g' /etc/lvm/lvm.conf
sudo sed -i 's/snapshot_autoextend_percent.*/snapshot_autoextend_percent = 20/g' /etc/lvm/lvm.conf
```
<!--END_DOCUSAURUS_CODE_TABS-->
## Related
- [Data sources](/docs/guides/data)
- [Database Lab tutorial for Amazon RDS](/docs/tutorials/database-lab-tutorial-amazon-rds)
- [Database Lab tutorial (generic)](/docs/tutorials/database-lab-tutorial)
[↵ Back to Administration guides](/docs/guides/administration)
---
title: Install and initialize Database Lab CLI
title: How to install and initialize Database Lab CLI
---
[↵ Back to CLI guides](/docs/guides/cli)
......
---
title: CLI guides
sidebar_label: CLI
title: How to work with Database Lab CLI
sidebar_label: Database Lab CLI
---
- [CLI install and init](/docs/guides/cli/cli-install-init)
- [How to install and initialize Database Lab CLI](/docs/guides/cli/cli-install-init)
[↵ Back to Guides](/docs/guides/)
---
title: Clone protection from manual and automatic deletion
title: Protect clones from manual and automatic deletion
---
[↵ Back to Cloning guides](/docs/guides/cloning)
......
---
title: Connect to a Database Lab clone
title: How to connect to Database Lab clones
---
[↵ Back to Cloning guides](/docs/guides/cloning)
......
---
title: Create a Database Lab clone
title: How to create Database Lab clones
---
[↵ Back to Cloning guides](/docs/guides/cloning)
......
---
title: Destroy a Database Lab clone
title: How to destroy Database Lab clone
---
[↵ Back to Cloning guides](/docs/guides/cloning)
......
---
title: Cloning guides
sidebar_label: Cloning
title: How to work with Database Lab clones
sidebar_label: How to work with clones
---
- [Create a Database Lab clone](/docs/guides/cloning/create-clone)
- [Connect to a Database Lab clone](/docs/guides/cloning/connect-clone)
- [Resetting a Database Lab clone state](/docs/guides/cloning/reset-clone)
- [Destroy a Database Lab clone](/docs/guides/cloning/destroy-clone)
- [Clone protection from manual and automatic deletion](/docs/guides/cloning/clone-protection)
- [How to create Database Lab clones](/docs/guides/cloning/create-clone)
- [How to connect to Database Lab clones](/docs/guides/cloning/connect-clone)
- [How to reset Database Lab clone](/docs/guides/cloning/reset-clone)
- [How to destroy Database Lab clone](/docs/guides/cloning/destroy-clone)
- [Protect clones from manual and automatic deletion](/docs/guides/cloning/clone-protection)
[↵ Back to Guides](/docs/guides/)
---
title: Resetting a Database Lab clone state
title: How to reset Database Lab clone
---
[↵ Back to Cloning guides](/docs/guides/cloning)
......
---
title: Data source: Custom
sidebar_label: Data source: Custom
---
[↵ Back to Data source guides](/docs/guides/data)
>You need to set up machine for Database Lab instance first. Check the [Setup machine for the Database Lab Engine](/docs/guides/administration/machine-setup) guide for the details.
## Configuration
With this data source type you can use any PostgreSQL backup tool (e.g. pg_basebackup, Barman, pgBackRest) to transfer the data to the Database Lab Engine instance.
### Jobs
To set up it you need to use following jobs:
- [physicalRestore](/docs/database-lab/config-reference#job-physicalrestore)
- [physicalSnapshot](/docs/database-lab/config-reference#job-physicalsnapshot)
### Options
Copy the contents of configuration example [`config.example.physical_generic.yml`](https://gitlab.com/postgres-ai/database-lab/-/blob/master/configs/config.example.physical_generic.yml) from the Database Lab repository to `~/.dblab/server.yml`. For demo purposes we've made example based on `pg_basebackup` tool, but you can use any tool suitable for the task. Check and update the following options:
- Set secure `server:verificationToken`, it will be used to authorize API requests to the Engine;
- Set connection options in `physicalRestore:options:envs`, based on your tool;
- Set PostgreSQL commands in `physicalRestore:options:customTool`:
- `command`: defines the command to restore data using a custom tool;
- `restore_command`: defines the PostgreSQL `restore_command` configuration option to refresh data;
- Set proper version in Postgres Docker images tags (change the images itself only if you know what are you doing):
- `provision:options:dockerImage`;
- `retrieval:spec:physicalRestore:options:dockerImage`;
- `retrieval:spec:physicalSnapshot:options:dockerImage`.
## Run Database Lab Engine
```bash
sudo docker run \
--name dblab_server \
--label dblab_control \
--privileged \
--publish 2345:2345 \
--volume /var/run/docker.sock:/var/run/docker.sock \
--volume /var/lib/dblab:/var/lib/dblab/:rshared \
--volume ~/.dblab/server.yml:/home/dblab/configs/config.yml \
--env DOCKER_API_VERSION=1.39 \
--detach \
--restart on-failure \
postgresai/dblab-server:2.0.0-beta.2
```
## Restart in the case of failure
```bash
TBD
```
[↵ Back to Data source guides](/docs/guides/data)
---
title: Data source: dump
sidebar_label: Data source: dump
---
[↵ Back to Data source guides](/docs/guides/data)
>You need to set up machine for Database Lab instance first. Check the [Setup machine for the Database Lab Engine](/docs/guides/administration/machine-setup) guide for the details.
## Configuration
### Jobs
In order to set up Database Lab Engine to automatically get the data from database using [dump/restore](https://www.postgresql.org/docs/current/app-pgdump.html) you need to use following jobs:
- [logicalDump](/docs/database-lab/config-reference#job-logicaldump)
- [logicalRestore](/docs/database-lab/config-reference#job-logicalrestore)
- [logicalSnapshot](/docs/database-lab/config-reference#job-logicalsnapshot)
### Options
Copy the contents of configuration example [`config.example.logical_generic.yml`](https://gitlab.com/postgres-ai/database-lab/-/blob/master/configs/config.example.logical_generic.yml) from the Database Lab repository to `~/.dblab/server.yml` and update the following options:
- Set secure `server:verificationToken`, it will be used to authorize API requests to the Engine;
- Set connection options in `retrieval:spec:logicalDump:options:source:connection`:
- `dbname`: database name to connect to;
- `host`: database server host;
- `port`: database server port;
- `username`: database user name;
- `password`: database master password (can be also set as `PGPASSWORD` environment variable of the Docker container).
- Set proper version in Postgres Docker images tags (change the images itself only if you know what are you doing):
- `provision:options:dockerImage`;
- `retrieval:spec:logicalRestore:options:dockerImage`;
- `retrieval:spec:logicalDump:options:dockerImage`.
## Run Database Lab Engine
```bash
sudo docker run \
--name dblab_server \
--label dblab_control \
--privileged \
--publish 2345:2345 \
--volume /var/run/docker.sock:/var/run/docker.sock \
--volume /var/lib/dblab/db.dump:/var/lib/dblab/db.dump \
--volume /var/lib/dblab:/var/lib/dblab/:rshared \
--volume ~/.dblab/server.yml:/home/dblab/configs/config.yml \
--env DOCKER_API_VERSION=1.39 \
--detach \
--restart on-failure \
postgresai/dblab-server:2.0.0-beta.2
```
You can use PGPASSWORD env to set the password.
## Restart Engine in the case of failure
```bash
# Stop and remove the Database Lab Engine control container.
sudo docker rm -f dblab_server
# Clean up data directory.
sudo rm -rf /var/lib/dblab/data/*
# Remove dump file.
sudo umount /var/lib/dblab/db.dump
sudo rm -rf /var/lib/dblab/db.dump
```
[↵ Back to Data source guides](/docs/guides/data)
---
title: Database Lab data sources
sidebar_label: Data sources
---
## Guides
### Logical
- [Dump](/docs/guides/data/dump)
- [RDS](/docs/guides/data/rds)
### Physical
- [Custom](/docs/guides/data/custom)
- [WAL-G](/docs/guides/data/wal-g)
- [pg_basebackup](/docs/guides/data/pg_basebackup)
## Overview
To start using cloning, you need to transfer the data to the Database Lab Engine machine first. Data retrieval can be also considered as "thick" cloning. Once it's done, users can use "thin" cloning to get independent full-size clones of the database in seconds, for testing and development. Normally, retrieval (thick cloning) is a slow operation (1 TiB/h is a good speed). Optionally, the process of keeping the Database Lab data directory in sync with the source (being continuously updated) can be configured.
>Read how you can protect PII in the [Data masking](/docs/database-lab/masking) article.
## Data retrieval types
### Logical
Use [dump/restore](https://www.postgresql.org/docs/current/app-pgdump.html) processes, obtaining a logical copy of the initial database (as a set of SQL commands), and then loading it to the target Database Lab data directory. This is the only option for managed cloud PostgreSQL services such as Amazon RDS.
Physically, the copy of the database created using this method differs from the original one (data blocks are stored differently). However, row counts are the same, as well as internal database statistics, allowing to do various kinds of development and testing, including running EXPLAIN command to optimize SQL queries.
### Physical
Physically copy the data directory from the source (or from the archive if a physical backup tool such as WAL-G, pgBackRest or Barman is used).
This approach allows to have a copy of the original database which is physically identical, including the existing bloat, data blocks location. Not supported for managed cloud Postgres services such as Amazon RDS.
[↵ Back to Guides](/docs/guides/)
---
title: Data source: pg_basebackup
sidebar_label: Data source: pg_basebackup
---
[↵ Back to Data source guides](/docs/guides/data)
>You need to set up machine for Database Lab instance first. Check the [Setup machine for the Database Lab Engine](/docs/guides/administration/machine-setup) guide for the details.
## Configuration
### Jobs
In order to set up Database Lab Engine to automatically get the data from database using [pg_basebackup](https://www.postgresql.org/docs/current/app-pgbasebackup.html) you need to use following jobs:
- [physicalRestore](/docs/database-lab/config-reference#job-physicalrestore)
- [physicalSnapshot](/docs/database-lab/config-reference#job-physicalsnapshot)
### Options
Copy the contents of configuration example [`config.example.physical_generic.yml`](https://gitlab.com/postgres-ai/database-lab/-/blob/master/configs/config.example.physical_generic.yml) from the Database Lab repository to `~/.dblab/server.yml` and update the following options:
- Set secure `server:verificationToken`, it will be used to authorize API requests to the Engine;
- Set connection options in `physicalRestore:options:envs`: