Collect 15-30min of Postgres logs with log_min_duration_statement = 0 for analysis and DB experiments
These actions are part of issue infrastructure#4921
As discussed with @abrandl and @dawsmith, I'm going to collect a sample (30-60 minutes) of all-queries log on Wednesday, Sep 19, starting at 12:00 UTC (the busiest time is ~13:00 UTC).
The brief description of actions
To do that, I'll setup log_min_duration_statement with manual alter system set .. command on the master, followed by select pg_reload_conf();. The plan (see details below) is to reach 0 iteratively, descending from the current value of 1 second to 0 by steps, with ~15 minutes apart, controlling IO and the log size.
Projected IO impact
The projected IO impact, based on pg_stat_statements view (it lacks query parameters, but this just adds some small error to the forecast – there are no huge Geo queries on the master node), based on last 10 hours of observation:
gitlabhq_production=# select now(), :'TS_PGSS_RESET' as pg_stat_statements_last_reset, now() - :'TS_PGSS_RESET' since_reset, sum(calls * length(query)) as total_bytes, sum(calls * length(query)) / extract(epoch from now() - :'TS_PGSS_RESET') as bytes_per_sec from pg_stat_statements;
-[ RECORD 1 ]-----------------+------------------------------
now | 2018-09-18 07:51:16.052095+00
pg_stat_statements_last_reset | 2018-09-17 21:19:59.557029+00
since_reset | 10:31:16.495066
total_bytes | 37407827046
bytes_per_sec | 987626.41529573
– this gives an estimate of ~1MB/s for writing Postgres log. This is based on not the busiest hours though, so it's subject to checking during Tuesday mid-day. This is an estimate for log_min_duration_statement = 0. For higher values, including the possible lowest positive 1ms, the excepted IO will be much lower.
Additional observation: a lot of queries under 1ms
If we take Top-15 (by total_time) queries among those which max_time is < 1mas, we'll see that those take >50% of time:
select
total_time,
sum(total_time) over () as total_time_registered_in_pss,
round(100::real * (sum(total_time) over (order by total_time desc)) / sum(total_time) over ())::text
|| '%' as accum_percentage,
min_time,
round(mean_time::numeric, 2) as mean_time,
max_time,
query
from pg_stat_statements
where max_time < 1
order by total_time desc
limit 15;
The detailed plan of actions
-
0) Tuesday, 2018-09-18 12:00 UTC: reset pg_stat_statements and double check the write IO estimate. In psql on Postgres master:
select now() as "TS_PGSS_RESET" from pg_stat_statements_reset() \gset
select
now(),
:'TS_PGSS_RESET' as pg_stat_statements_last_reset,
now() - :'TS_PGSS_RESET' since_reset,
sum(calls * length(query)) as total_bytes,
sum(calls * length(query)) / extract(epoch from now() - :'TS_PGSS_RESET') as bytes_per_sec
from pg_stat_statements;
\watch 300
-
1) Wednesday, 2018-09-19 11:xx Preparations.
Set millisecond precision in log timestamps – use %m instead of %t (currently we have second-level precision, %t):
alter system set log_line_prefix = '%m [%p]: [%l-1] db=%d,user=%u ';
select pg_reload_conf();
-
2) Wednesday, 2018-09-19 12:00 UTC: start descending.
In psql on Postgres master:
alter system set log_min_duration_statement = '100ms';
select pg_reload_conf();
-
3) Observe the current size of the log and avg speed of its growth:
sudo ls -lah /var/log/gitlab/postgresql/current
# this will print avg bytes/sec based on the full "current" log file
sudo head -n1 /var/log/gitlab/postgresql/current | head -c 22 | sed 's/_/T/g' | xargs date +"%s" --date \
| awk \
-v date="$(date +"%s")" \
'{print (date - $1)}' \
| awk \
-v bytes="$(sudo wc -c /var/log/gitlab/postgresql/current | awk '{print $1}')" \
'{print (bytes / $1)}'
And check IO with:
-
sudo iotop -o -
monitoring
-
4) Wednesday, 2018-09-19 12:15 UTC: continue descending, set to '10ms'
In psql on Postgres master:
alter system set log_min_duration_statement = '10ms';
select pg_reload_conf();
- continue observing as described in step 3
-
5) Wednesday, 2018-09-19 12:30 UTC: continue descending, set to '5ms'
In psql on Postgres master:
alter system set log_min_duration_statement = '5ms';
select pg_reload_conf();
-
continue observing as described in step 3
-
check IO with
sudo iotop -oand monitoring (https://dashboards.gitlab.net/d/pEfSMUhmz/postgresql-disk-io?orgId=1 + GCP graphs)
-
6) Wednesday, 2018-09-19 12:45 UTC: continue descending, set to '1ms'
In psql on Postgres master:
alter system set log_min_duration_statement = '1ms';
select pg_reload_conf();
-
continue observing as described in step 3
-
check IO with
sudo iotop -oand monitoring (https://dashboards.gitlab.net/d/pEfSMUhmz/postgresql-disk-io?orgId=1 + GCP graphs)
-
7) DECISION TO BE MADE BASED ON OBSERVED NUMBERS: If the current IO caused by logging is > 1MB/s, stop with 1ms, and collect only partial logs, with this threshold. If it's below, continue with last step of descending to collect all-queries log:
alter system set log_min_duration_statement = 0;
select pg_reload_conf();
-
8) Collect logs (minimum 15 minutes, max 30 minutes):
sudo cp /var/log/gitlab/postgresql/current ~nikolays/postgres_0_20180919.log
-
9) Return to the initial state - in psql on the master node:
alter system reset log_min_duration_statement;
alter system reset log_line_prefix;
select pg_reload_conf();
\c
show log_min_duration_statement;
show log_line_prefix;
Notes on possible risks
No risks to have any performance degradation are expected. In case of any unpredictable IO, the step 8 will be immediately applied.
It expected, that ~2-4GB of logs will be generated during collection phase (with projected generation speed ~1-2MB/s), which is an equivalent to ~10 days of logging with current log_min_duration_statement threshold (1s)
This might cause some delays in Kibana processing.
TODO after these works are done
After this is complete, TODO (separate issues):
- process, analyze logs with pgBadger
- PoC usage of logs with pgreplay on experimental DB nodes
-
set log_min_duration_statement to 100ms;permanently - set better
log_line_prefix - stop using "one for all" DB roles, use separate ones for different apps and for administration.