Add int4 saturation metrics for postgres
Ref: gitlab-org/gitlab#399485 (closed)
I was able to use @stomlinson's awesome query in an almost unmodified form. Simon expressed a bit of concern that using the highest value in pg_stats.histogram_bounds
might be an incorrect assumption, so I used an alternate method of getting max IDs to gather some data about how the actual max IDs might differ from the values presented in pg_stats.histogram_bounds
. It is fairly cheap to look up sequences and get the last value from them, but this requires a stored procedure.
Click to expand stored procedure
CREATE OR REPLACE FUNCTION auto_increment_saturation() RETURNS TABLE (
table_name TEXT,
column_name TEXT,
column_type TEXT,
sequence_value BIGINT,
row_saturation REAL
) AS $$
DECLARE
row record;
colmax bigint := 2147483647;
BEGIN
FOR row IN SELECT
substring(a.column_default, '''([a-z0-9_]+)''') AS sequence,
a.table_name,
a.column_name,
a.data_type,
a.table_schema
FROM
information_schema.columns a
WHERE
column_default LIKE 'nextval%'
AND data_type = 'integer'
LOOP
table_name := row.table_name;
column_name := row.column_name;
column_type := row.data_type;
-- RAISE NOTICE '%', row.sequence;
EXECUTE 'SELECT last_value FROM ' || quote_ident(row.table_schema) || '.' || quote_ident(row.sequence) INTO sequence_value;
row_saturation := sequence_value::real / colmax::real;
RETURN NEXT;
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
Then, I used this query to compare against the values in pg_stats.histogram_bounds
.
Click to expand query
WITH saturations AS (
SELECT
*
FROM
auto_increment_saturation ()
),
max_values AS (
SELECT
(histogram_bounds::text::int8[])[array_upper(histogram_bounds, 1)] AS largest_value,
tablename,
attname
FROM
pg_stats
)
SELECT
largest_value,
table_name,
row_saturation,
sequence_value,
ABS(largest_value - sequence_value) AS delta
FROM
saturations
INNER JOIN max_values ON saturations.table_name = max_values.tablename
AND saturations.column_name = max_values.attname
ORDER BY
delta DESC
LIMIT 40;
And the results look good to me:
largest_value | table_name | row_saturation | sequence_value | delta
---------------+--------------------------------------+----------------+----------------+---------
1004652501 | system_note_metadata | 0.46864215 | 1006401331 | 1748830
476987413 | ci_pipeline_variables | 0.22244309 | 477692881 | 705468
490584119 | deployments | 0.22873919 | 491213667 | 629548
577976208 | ci_pipelines | 0.2693658 | 578458615 | 482407
321355514 | todos | 0.14985073 | 321801998 | 446484
92109814 | merge_request_assignees | 0.043023333 | 92391905 | 282091
64250201 | approvals | 0.030025 | 64478197 | 227996
257619700 | label_links | 0.12006791 | 257843866 | 224166
265752024 | lfs_objects_projects | 0.1238513 | 265968644 | 216620
98551386 | oauth_access_tokens | 0.045984253 | 98750434 | 199048
67890189 | note_diff_files | 0.031699948 | 68075116 | 184927
80381642 | uploads | 0.037486024 | 80500626 | 118984
25015845 | ci_sources_pipelines | 0.011693623 | 25111865 | 96020
58424982 | gpg_signatures | 0.027248267 | 58515209 | 90227
61765195 | notification_settings | 0.02880333 | 61854678 | 89483
46927462 | project_features | 0.021890186 | 47008818 | 81356
77622223 | members | 0.03618322 | 77702870 | 80647
55115349 | user_agent_details | 0.025695788 | 55181283 | 65934
There seems to be a fairly linear margin of error of about ~0.1% commensurate to the current largest value, so I believe that we can rely on this method to safely assess a rough estimate of the largest ID in each column that we track. Of course, my method only compares columns where there is a very uniform distribution of values so I believe that there could be columns where histogram_bounds
has a higher margin for error, but upon analyzing the output of Simon's query, it looks like all of the columns selected are ID columns anyway. I am confident in this method to give us useful metrics.
Note on performance
This is not the most performant query. There are two factors that affect this:
- One of the heaviest operations is due to a join that takes place which AFAIK we have no control over (an intermediate join when we join pg_stats against pg_attribute).
- We have to read every single
histogram_bounds
field to grab the last member for every int4 in the DB.
There are ways to solve these problems, but for simplicity we should try to stick to a single query. Although it is on the heavier side (1.5-2s), we only have to run it occasionally.
@reprazent I was wondering if you could help with the following:
- Is there a way we can run this less frequently?
- We don't need to run on every single DB system, can we isolate it to only run on one replica?
Also, I am presenting the column name, the current value, and the column max value. Will this work for our needs in generating a percentage downstream?