Database Lab - Telemetry
Goal
As a Database Lab user I want to be able to collect performance metrics about a clone (e.g. for verifying of database migrations).
Implement basic support with following artefacts collection:
-
Telemetry collection duration; -
Locks monitoring. -
CI/CD: how can we understand if it fails?
TODO / How to implement
Original idea: New client CLI commands:
-
dblab telemetry start
- starts collection of metrics (deletes previously collected artefacts); -
dblab telemetry stop
- stops collection of metrics, interval services, executes additional scripts; -
dblab telemetry status
- gets short summary about migration status; -
dblab telemetry download
- downloads artefacts from Database Lab machine to a user's machine.
Implement corresponding API handles.
Usage:
dblab clone create ...
dblab telemetry start
sqitch deploy
dblab telemetry stop
dblab telemetry download
Suggestions from standup call:
- Use Prometheus format for export of time series data
dblab clone observe -f CLONE_ID
- Use
start
,stop
or not?
Improved idea:
- Collect telemetry and expose it in Prometheus format or use https://github.com/wrouesnel/postgres_exporter.
- Collect telemetry all the time for all clones.
- Use
dblab clone observe -f CLONE_ID
to retrieve current telemetry for the clone. - Use
dblab clone start-observe CLONE_ID
anddblab clone stop-observe CLONE_ID
to get aggregated statistics about duration and locks and the end of a DB migration test.
Fastest implementation:
- Execute monitoring queries from Database Lab CLI itself, do not change server or its API.
- Use
dblab clone observe -f CLONE_ID
to start observing anddblab clone stop-observe CLONE_ID
to stop observing and show aggregated metrics.
Acceptance criteria
Edited by Anatoly Stansler