Basic graphs helping to determine bottlenecks
The set of artifacts currently delivered in the end of each experiment lacks the following:
- What was resources consumption (CPU, memory, IO, etc)?
- What was the main bottleneck? (CPU / IO / network / etc)?
- When a resource's consumption was higher, when was it lower? (historical data)
To mitigate this, we need to add graphs. Roadmap:
This metrics is the major one: CPU. But we need to get info about each vCPU/core – this is very important to detect bottlenecks
- find a way, how to collect it / discuss / decide
This metrics is the major one: memory. But not just "how much was consumed", we need more.
define, how to get a much deeper look at memory (
vm.diry_**? see Тюрин_pgconf19.pdf what else?)
- define, how to get a much deeper look at memory (
This metrics is the major one: IO
At bare minimum we need: read/write IOPS, read/write throughput. What we should use,
iotop? Discuss, decide. (Additionally:
iostat? queue size, %util?)
- At bare minimum we need: read/write IOPS, read/write throughput. What we should use,
TODO / How to implement
collect iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/203
graph iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/207
for remote experiments on AWS, CloudWatch https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/Welcome.html might be helpful
- check their API - could we use it? https://gitlab.com/postgres-ai-team/nancy/issues/204
- do we need "Enhanced Monitoring"? If so, what is its overhead?
- implement collection of main (CPU, RAM, IO) metrics from CloudWatch in form of artifacts
consider: pgmetrics https://pgmetrics.io/ – can this be helpful for us? Learn, define where it can be helpful, where it cannot.
consider: pgio https://gitlab.com/ongresinc/pgio
consider: [optional] installation/use of okmeter for some cases?
consider: prometheus with some exporter for Postgres?
Postgres-specific metrics, historical data:
- amount of WAL data generated
- checkpoints. Forced, planned. Writes from checkpointer, bgwriter, backends.
representation (draw graphs!)
- consider various. "Draw yourself" (with some libraries involved), or use existing monitoring tools? (e.g. https://blog.timescale.com/grafana-time-series-exploration-visualization-postgresql-8c7baa9c3bfe/)
As a user of Nancy, I can see:
- what was resources consumption during the experiment (CPU, mem, disk, network)
- I see it separately for the "initialization" and "workload" stages
- I see it in historical perspective (time series, graphs)
- in many cases, all this information helps me to understand, what was the bottleneck at each concrete moment (example: I see that 1 CPU was 100%, disk util was 80% => I check perf and see that the majority, >50% of time was spent for working with buffer pool's hash table; so I make a conclusion that CPU and work with shared buffers was the main bottleneck).