Skip to content

Approximate table counts based on TABLESAMPLE

Andreas Brandl requested to merge ab-approximate-counts into master

What does this MR do?

  1. Refactor the existing approximate count strategies towards a Strategy pattern
  2. Implement a approximate count strategy for PostgreSQL using TABLESAMPLE (behind feature flag)

The goal here is to fix the admin dashboard (which is currently the only user of approximate counts AFAIS).

For PostgreSQL, the order of strategies is:

  1. Reltuples
  2. Tablesample
  3. Exact count

The Tablesample strategy works like this:

  1. Use reltuples statistics for an estimate E (irrespective of how the statistic is)
  2. If E < EXACT_COUNT_THRESHOLD, perform an exact count
  3. Else count a sample of roughly TABLESAMPLE_ROW_TARGET rows and interpolate to estimate full relation size.

The idea is that in both cases, we control how many rows are being used to estimate the relation size. In the tablesample case, we use E to calculate the portion of the table to count (TABLESAMPLE is based on a percentage of the table).

The unsolved issue here is that in case the reltuples statistic is way off reality, we may end up trying to count a large portion of the table. For example, if reltuples claims the table has only one record, we would perform an exact count. If in reality the table has millions of records, we're out of luck. In that sense, this change is an optimization of the current situation where we are out-of-luck most of the time (due to statistics being too old).

Note that the tablesample strategy feature can be toggled with tablesample_counts.

What are the relevant issue numbers?

https://gitlab.com/gitlab-org/gitlab-ce/issues/54116

Does this MR meet the acceptance criteria?

Edited by Andreas Brandl

Merge request reports