Automated data retrieval (declarative database lab initialization)

Goal

TODO / How to implement

Config example

retrieval:
  stages:
    - initialize

  spec:
    initialize:
      jobs:
#        - name: logical-restore
#          options:
#            dumpFile: /tmp/db.dump
#            forceInit: false
#            dbName: test
#            partial:
#              tables:
#                - test
        - name: physical-restore
          options:
            tool: walg
            dockerImage: "postgresai/sync-instance:12"
            envs:
              WALG_GS_PREFIX: "gs://{BUCKET}/{SCOPE}"
            walg:
              storage: gcs
              backupName: LATEST
              credentialsFile: /tmp/sa.json # optional

---- OLD Implementation:

Stage interface (data retrieval, promotion, mask, etc) 5h
- Clones/snapshot usage
- Docker container provision
- Each stage running in separate container
Pipeline:
- [dump/restore | WAL-G | barman -> PGDATA] -> [PGDATA master/replica -> master -> remove PII -> snapshot]
- [import] -> [promote] -> [mask]
- Stage 1: [import]
  - Configuration 2h
    - snapshot TTL
    - mode
    - dockerImage
    - dump/restore
      - connection params
      - plain-text, directory, custom (-Fc -Fd)
  - Experiments/Preparations 6h
    - Massive diff (!)
      - Delete previous clone snapshots
  - Logic 6h
    - dump/restore
    - set statement timeout to 0
- Stage 2: [promote]
  - Can be optional. As we want to give SRE an ability to manually promote their clone.
  - Configuration 1h
    - dockerImage
  - Logic 6h
- Stage 3: [mask] OUT-OF-SCOPE
Pipeline scheduling (run data retrieval on interval) 4h

Documentation:

Notify users about autovacuum pause 1h
Pipelines docs 6h
- Configs
- How it works
- How to extend

Acceptance criteria

Edited Jul 14, 2020 by Artyom Kartasov