count users query

A relatively common but unfortunately slow query has the form:

SELECT COUNT(*) FROM "users"
 WHERE
 ("users"."state" IN ('active'))
 AND ("users"."ghost" = 'f' OR "users"."ghost" IS NULL)
 AND ("users"."support_bot" = 'f' OR "users"."support_bot" IS NULL)
 AND (
  (id IN (
    SELECT "members"."user_id"
    FROM "members"
    WHERE "members"."source_type" = 'Project'
      AND "members"."type" IN ('ProjectMember')
      AND "members"."source_id" = ?
      AND "members"."source_type" = 'Project'
      AND "members"."type" IN ('ProjectMember')
      AND "members"."requested_at" IS NULL
      AND (access_level > ?))
   OR id IN (?, ?, ?)
   OR id IN (
     SELECT "members"."user_id"
     FROM "members"
     WHERE "members"."source_type" = 'Namespace'
       AND "members"."type" IN ('GroupMember')
       AND "members"."source_id" = ?
       AND "members"."source_type" = 'Namespace'
       AND "members"."type" IN ('GroupMember')
       AND "members"."requested_at" IS NULL
       AND (access_level > ?)))
  AND id NOT IN (
    SELECT "approvals"."user_id"
    FROM "approvals"
    WHERE "approvals"."merge_request_id" = ?))
 AND ("users"."id" != ?);

Part of the difficulty around this query is that PG has trouble estimating the number of rows which are going to be returned based off of the statistics from the users table, leading it to decide that a sequential scan is the best plan here, on the assumption that most of the users table is involved in this query. This can be seen in the EXPLAIN output from the query, where the top seqscan node shows an estimate of rows=544012.

In reality, very few tuples are involved in the query, and we can therefore make it extremely fast by reworking it a bit. Instead of using the IN clauses to pull ids to compare the current ID against, we can pull all relevant IDs from a single sub-query, then filter those ids and lastly unique/distinct them before counting them, as in this query:

explain (analyze, verbose, buffers)
select count(distinct id) from
(
select users.id from
 users
where
 ("users"."state" IN ('active'))
 AND ("users"."ghost" = 'f' OR "users"."ghost" IS NULL)
 AND ("users"."support_bot" = 'f' OR "users"."support_bot" IS NULL)
 AND (id IN (?, ?, ?))
union all
select users.id from
 users
join members on (users.id = members.user_id)
where
 ("users"."state" IN ('active'))
 AND ("users"."ghost" = 'f' OR "users"."ghost" IS NULL)
 AND ("users"."support_bot" = 'f' OR "users"."support_bot" IS NULL)
and  "members"."source_type" = 'Project'
      AND "members"."type" IN ('ProjectMember')
      AND "members"."source_id" = ?
      AND "members"."source_type" = 'Project'
      AND "members"."type" IN ('ProjectMember')
      AND "members"."requested_at" IS NULL
      AND (access_level > ?)
union all
select users.id from
 users
join members on (users.id = members.user_id)
     WHERE "members"."source_type" = 'Namespace'
       AND "members"."type" IN ('GroupMember')
       AND "members"."source_id" = ?
       AND "members"."source_type" = 'Namespace'
       AND "members"."type" IN ('GroupMember')
       AND "members"."requested_at" IS NULL
       AND (access_level > ?)
) as a
where id != ?
and id not in (
    SELECT "approvals"."user_id"
    FROM "approvals"
    WHERE "approvals"."merge_request_id" = ?)
;

Using this approach, at least for one example case, we were able to reduce the query time from ~1.3s to ~0.5ms (though 1ms is spent in planning). This is due, in large part, because the estimated row counts are much more in-line with reality and PostgreSQL is able to realize that because each branch of the subquery uses an AND structure to filter down the records and PG can reason that only a few records will pass the conditionals. Based on that, the indexes are used to access the users table to pull out just the few records which are needed for the query.