Skip to content

GitLab

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

This is an archived project. Repository and other project resources are read-only.

Postgres.ai
nancy β
Issues
#193

Basic graphs helping to determine bottlenecks

Goal

The set of artifacts currently delivered in the end of each experiment lacks the following:

What was resources consumption (CPU, memory, IO, etc)?
What was the main bottleneck? (CPU / IO / network / etc)?
When a resource's consumption was higher, when was it lower? (historical data)

To mitigate this, we need to add graphs. Roadmap:

This metrics is the major one: CPU. But we need to get info about each vCPU/core – this is very important to detect bottlenecks
- find a way, how to collect it / discuss / decide
- implement
This metrics is the major one: memory. But not just "how much was consumed", we need more.
- define, how to get a much deeper look at memory (/proc/meminfo? vm.diry_**? see Тюрин_pgconf19.pdf what else?)
- implement
This metrics is the major one: IO
- At bare minimum we need: read/write IOPS, read/write throughput. What we should use, iotop? Discuss, decide. (Additionally: iostat? queue size, %util?)
- implement

TODO / How to implement

collect iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/203
graph iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/207
for remote experiments on AWS, CloudWatch https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/Welcome.html might be helpful
- check their API - could we use it? https://gitlab.com/postgres-ai-team/nancy/issues/204
- do we need "Enhanced Monitoring"? If so, what is its overhead?
- implement collection of main (CPU, RAM, IO) metrics from CloudWatch in form of artifacts
consider: pgmetrics https://pgmetrics.io/ – can this be helpful for us? Learn, define where it can be helpful, where it cannot.
consider: pgio https://gitlab.com/ongresinc/pgio
consider: [optional] installation/use of okmeter for some cases?
consider: prometheus with some exporter for Postgres?
https://gitlab.com/postgres-ai-team/nancy/issues/193
Postgres-specific metrics, historical data:
- amount of WAL data generated
- checkpoints. Forced, planned. Writes from checkpointer, bgwriter, backends.
- autovacuum
  https://gitlab.com/postgres-ai-team/nancy/issues/206
representation (draw graphs!)
- consider various. "Draw yourself" (with some libraries involved), or use existing monitoring tools? (e.g. https://blog.timescale.com/grafana-time-series-exploration-visualization-postgresql-8c7baa9c3bfe/)
- discuss/decide
- implement
  https://gitlab.com/postgres-ai-team/nancy/issues/207

Acceptance criteria

As a user of Nancy, I can see:

what was resources consumption during the experiment (CPU, mem, disk, network)
I see it separately for the "initialization" and "workload" stages
I see it in historical perspective (time series, graphs)
in many cases, all this information helps me to understand, what was the bottleneck at each concrete moment (example: I see that 1 CPU was 100%, disk util was 80% => I check perf and see that the majority, >50% of time was spent for working with buffer pool's hash table; so I make a conclusion that CPU and work with shared buffers was the main bottleneck).

Edited Mar 11, 2019 by Anna

Assignee

Select assignees

Time tracking