Skip to content

Add int4 saturation metrics for postgres

Jon Jenkins requested to merge (removed):int-column-saturation into master

Ref: gitlab-org/gitlab#399485 (closed)

I was able to use @stomlinson's awesome query in an almost unmodified form. Simon expressed a bit of concern that using the highest value in pg_stats.histogram_bounds might be an incorrect assumption, so I used an alternate method of getting max IDs to gather some data about how the actual max IDs might differ from the values presented in pg_stats.histogram_bounds. It is fairly cheap to look up sequences and get the last value from them, but this requires a stored procedure.

Click to expand stored procedure
CREATE OR REPLACE FUNCTION auto_increment_saturation() RETURNS TABLE (
    table_name TEXT,
    column_name TEXT,
    column_type TEXT,
    sequence_value BIGINT,
    row_saturation REAL
) AS $$
DECLARE
  row record;
  colmax bigint := 2147483647;
BEGIN
    FOR row IN SELECT
            substring(a.column_default, '''([a-z0-9_]+)''') AS sequence,
            a.table_name,
            a.column_name,
            a.data_type,
            a.table_schema
        FROM
            information_schema.columns a
        WHERE
            column_default LIKE 'nextval%'
            AND data_type = 'integer'
    LOOP
        table_name := row.table_name;
        column_name := row.column_name;
        column_type := row.data_type;
        -- RAISE NOTICE '%', row.sequence;
        EXECUTE 'SELECT last_value FROM ' || quote_ident(row.table_schema) || '.' || quote_ident(row.sequence) INTO sequence_value;
        row_saturation := sequence_value::real / colmax::real;
        RETURN NEXT;
        END LOOP;
    RETURN;
END;
$$ LANGUAGE plpgsql;

Then, I used this query to compare against the values in pg_stats.histogram_bounds.

Click to expand query
WITH saturations AS (
    SELECT
        *
    FROM
        auto_increment_saturation ()
),
max_values AS (
    SELECT
        (histogram_bounds::text::int8[])[array_upper(histogram_bounds, 1)] AS largest_value,
        tablename,
        attname
    FROM
        pg_stats
)
SELECT
    largest_value,
    table_name,
    row_saturation,
    sequence_value,
    ABS(largest_value - sequence_value) AS delta
FROM
    saturations
    INNER JOIN max_values ON saturations.table_name = max_values.tablename
        AND saturations.column_name = max_values.attname
    ORDER BY
        delta DESC
    LIMIT 40;

And the results look good to me:

 largest_value |              table_name              | row_saturation | sequence_value |  delta  
---------------+--------------------------------------+----------------+----------------+---------
    1004652501 | system_note_metadata                 |     0.46864215 |     1006401331 | 1748830
     476987413 | ci_pipeline_variables                |     0.22244309 |      477692881 |  705468
     490584119 | deployments                          |     0.22873919 |      491213667 |  629548
     577976208 | ci_pipelines                         |      0.2693658 |      578458615 |  482407
     321355514 | todos                                |     0.14985073 |      321801998 |  446484
      92109814 | merge_request_assignees              |    0.043023333 |       92391905 |  282091
      64250201 | approvals                            |       0.030025 |       64478197 |  227996
     257619700 | label_links                          |     0.12006791 |      257843866 |  224166
     265752024 | lfs_objects_projects                 |      0.1238513 |      265968644 |  216620
      98551386 | oauth_access_tokens                  |    0.045984253 |       98750434 |  199048
      67890189 | note_diff_files                      |    0.031699948 |       68075116 |  184927
      80381642 | uploads                              |    0.037486024 |       80500626 |  118984
      25015845 | ci_sources_pipelines                 |    0.011693623 |       25111865 |   96020
      58424982 | gpg_signatures                       |    0.027248267 |       58515209 |   90227
      61765195 | notification_settings                |     0.02880333 |       61854678 |   89483
      46927462 | project_features                     |    0.021890186 |       47008818 |   81356
      77622223 | members                              |     0.03618322 |       77702870 |   80647
      55115349 | user_agent_details                   |    0.025695788 |       55181283 |   65934

There seems to be a fairly linear margin of error of about ~0.1% commensurate to the current largest value, so I believe that we can rely on this method to safely assess a rough estimate of the largest ID in each column that we track. Of course, my method only compares columns where there is a very uniform distribution of values so I believe that there could be columns where histogram_bounds has a higher margin for error, but upon analyzing the output of Simon's query, it looks like all of the columns selected are ID columns anyway. I am confident in this method to give us useful metrics.

Note on performance

This is not the most performant query. There are two factors that affect this:

  1. One of the heaviest operations is due to a join that takes place which AFAIK we have no control over (an intermediate join when we join pg_stats against pg_attribute).
  2. We have to read every single histogram_bounds field to grab the last member for every int4 in the DB.

There are ways to solve these problems, but for simplicity we should try to stick to a single query. Although it is on the heavier side (1.5-2s), we only have to run it occasionally.

@reprazent I was wondering if you could help with the following:

  1. Is there a way we can run this less frequently?
  2. We don't need to run on every single DB system, can we isolate it to only run on one replica?

Also, I am presenting the column name, the current value, and the column max value. Will this work for our needs in generating a percentage downstream?

Merge request reports