Basic graphs helping to determine bottlenecks
Goal
The set of artifacts currently delivered in the end of each experiment lacks the following:
- What was resources consumption (CPU, memory, IO, etc)?
- What was the main bottleneck? (CPU / IO / network / etc)?
- When a resource's consumption was higher, when was it lower? (historical data)
To mitigate this, we need to add graphs. Roadmap:
-
This metrics is the major one: CPU. But we need to get info about each vCPU/core – this is very important to detect bottlenecks -
find a way, how to collect it / discuss / decide -
implement
-
-
This metrics is the major one: memory. But not just "how much was consumed", we need more. -
define, how to get a much deeper look at memory ( /proc/meminfo
?vm.diry_**
? see Тюрин_pgconf19.pdf what else?) -
implement
-
-
This metrics is the major one: IO -
At bare minimum we need: read/write IOPS, read/write throughput. What we should use, iotop
? Discuss, decide. (Additionally:iostat
? queue size, %util?) -
implement
-
TODO / How to implement
-
collect iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/203 -
graph iostat and mpstat https://gitlab.com/postgres-ai-team/nancy/issues/207 -
for remote experiments on AWS, CloudWatch https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/Welcome.html might be helpful -
check their API - could we use it? https://gitlab.com/postgres-ai-team/nancy/issues/204 -
do we need "Enhanced Monitoring"? If so, what is its overhead? -
implement collection of main (CPU, RAM, IO) metrics from CloudWatch in form of artifacts
-
-
consider: pgmetrics https://pgmetrics.io/ – can this be helpful for us? Learn, define where it can be helpful, where it cannot. -
consider: pgio https://gitlab.com/ongresinc/pgio -
consider: [optional] installation/use of okmeter for some cases? -
consider: prometheus with some exporter for Postgres?
https://gitlab.com/postgres-ai-team/nancy/issues/193 -
Postgres-specific metrics, historical data: -
amount of WAL data generated -
checkpoints. Forced, planned. Writes from checkpointer, bgwriter, backends. -
autovacuum
https://gitlab.com/postgres-ai-team/nancy/issues/206
-
-
representation (draw graphs!) -
consider various. "Draw yourself" (with some libraries involved), or use existing monitoring tools? (e.g. https://blog.timescale.com/grafana-time-series-exploration-visualization-postgresql-8c7baa9c3bfe/) -
discuss/decide -
implement
https://gitlab.com/postgres-ai-team/nancy/issues/207
-
Acceptance criteria
As a user of Nancy, I can see:
- what was resources consumption during the experiment (CPU, mem, disk, network)
- I see it separately for the "initialization" and "workload" stages
- I see it in historical perspective (time series, graphs)
- in many cases, all this information helps me to understand, what was the bottleneck at each concrete moment (example: I see that 1 CPU was 100%, disk util was 80% => I check perf and see that the majority, >50% of time was spent for working with buffer pool's hash table; so I make a conclusion that CPU and work with shared buffers was the main bottleneck).