0094: How to run ANALYZE (to collect statistics)
@fomin.list @vitabaks @msdousti
Merge request reports
Activity
- Resolved by Nikolay Samokhvalov
- Resolved by Nikolay Samokhvalov
- 0094_how_to_run_analyze.md 0 → 100644
1 # How to run ANALYZE (to collect statistics) 2 The command `ANALYZE` collect statistics about a database ([docs](https://www.postgresql.org/docs/current/sql-analyze.html)). Maintaining fresh statistics is crucial for achieving good database performance. 3 4 Running it is trivial: 5 ```sql 6 analyze; 7 ``` 8 9 However, this, being single-threaded, can take a lot of time. 10 11 ## How to run ANALYZE at full speed 12 To utilize multiple CPU cores, we can use client program `vacuumdb` with option `--analyze-only` and multiple workers ([docs](https://www.postgresql.org/docs/current/app-vacuumdb.html)). 13 14 The following runs `ANALYZE` on *all* databases (`--all`), using the number of workers matching the number of vCPUs, and limiting overall duration by 2 hours (connection options like `-h`, `-U` are not shown here): changed this line in version 7 of the diff
- 0094_how_to_run_analyze.md 0 → 100644
17 while IFS= read -r line; do 18 echo "$(date '+%Y-%m-%d %H:%M:%S') $line" 19 done 20 } < <( 21 PGOPTIONS='-c statement_timeout=2h' \ 22 vacuumdb \ 23 --analyze-only \ 24 --all \ 25 --jobs $(nproc) \ 26 --echo 27 ) | tee -a analyze_all_$(date +%Y%m%d).log 28 ``` 29 30 With this snippet, all the commands are going to be also printed and logged, with a timestamps (alternatively, instead of the `while`, one could use `ts` from `moreutils`). 31 32 `--jobs $(nproc)` works for Linux, and it defines the number of workers matching the number of vCPUs. Note that if there are large unpartitioned tables, at some point, only a few workers may remain active. A solution to this problem can be partitioning: with many smaller partitions, it can allow all workers to remain busy, which can speed up the whole operation drastically on machines with a high number of CPU cores. This was a very nice HowTo, thanks!
I put a few minor comments. Two things that IMHO can improve the how to:
- A paragraph about the required permission to run
analyze
on a table/database. - Pointing out to cost-based vacuum delay. We have a pg_cron job that vacuums (and analyzes) our tables during midnight, when the load is low. But we found it rather intrusive, so we adjusted the parameters to make diminish its effect. This is a flip side of this HowTo, where it shows how to run it as fast as possible. (I saw your Tweet, and I'm wondering whether these parameters affect only vacuum or analyze as well; since the documentation list above mentions both: "During the execution of VACUUM and ANALYZE commands...")
- A paragraph about the required permission to run
Docs say it affects both, despite the name: https://postgres.ai/chats/01928625-70a6-78f6-a20c-17d74bcad95a (checked source code there too)
Thanks!!
In our case, we have a super time-critical service with p99 < 5ms.
A normal vacuum of the whole DB takes only 20 mins, but p99 exceeds 1 second, which is not acceptable for us. After some fine tuning, I could run it with p99 < 200 ms over the course of 2-3 hours. This is acceptable during midnight.
PS: We run scheduled
vacuum (analyze, skip_locked, verbose);
. Withoutvacuum
(onlyanalyze
), it would not be intrusive much, and would run quite fast, as the DB is not super big (100-200 GB).Edited by Sadeq DoustiI saw an interesting case today: We have a ~ 4TB database, with less than 20 tables. Largest tables are ~ 1.5 TB, 1 TB, and 0.5 TB.
Nightly
analyze
(notvacuum
, justanalyze
) is being executed on the whole DB.- This was previously run on RDS (32 GB RAM, 8 vCPUs) and it took ~ 15 mins.
- We recently migrated to Aurora (64 GB RAM, 8 vCPUs) and it takes ~ 30-60 mins.
We are investigating ...
I'd be curious to learn what you find. Can share also this post I read recently that shows classic RDS (with large disk) being faster than Aurora for disk operations, the difference seems similar to what you observed.
We started analzing table by table. Huge tables (> TB) were quite fast and used very little memory (400 MB). But, on one small table (3 GB) the analyze took long, memory grew pretty fast, and DB server was killed due to OOM.
We used vacuum and vacuum freeze on the table, both ran without issue.
In running vacuum full, we saw the issue:
ERROR: found xmin 183032816 from before relfrozenxid 514166874
This means the table visibility map is corrupted.
This is a known issue: https://postgrespro.com/list/thread-id/2501007
A tool like pg_surgery offers heap_force_freeze(), which we can't use on Aurora.
I found another way to fix visibility maps, which seems to work. But I'm surprised at the fact that running analyze on such a table can OOMKill the whole server, as opposed to gracefully ending the analyze process.