Skip to content

Optimize high frequency query: "count(*) from projects join project_mirror_data", >50% of total_time on the master (gitlab.com)

On GitLab.com, on Postgres master we observe that the following query takes ~50% of total_time according to pg_stat_statements (so ~50% of CPU time spent on the master server is related to this query):

SELECT
    COUNT(*)
FROM
    "projects"
    INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id
WHERE
    "projects"."mirror" = ?
    AND ("import_state"."status" NOT IN (?, ?))
    AND (import_state.next_execution_timestamp <= ?)
    AND (import_state.retry_count <= ?)

This query's profile according to pg_stat_statements:

# Calls ▼ Total time Rows shared_blks_hit shared_blks_read shared_blks_dirtied shared_blks_written blk_read_time blk_write_time kcache_reads kcache_writes kcache_user_time_ms kcache_system_time Query
1 192,469
30.36/sec
1.00/call
0.26%
15,234,607.23 ms
2.403s/sec
79ms/call
49.21%
192,469
30.36/sec
1.00/call
0.05%
7,534,894,752 blks
1.19M blks/sec
39.15K blks/call
64.63%
0 blks
0.00 blks/sec
0.00 blks/call
0.00%
2,033 blks
0.32 blks/sec
0.01 blks/call
0.04%
0 blks
0.00 blks/sec
0.00 blks/call
0.00%
0.00 ms
0s/sec
0s/call
0.00%
0.00 ms
0s/sec
0s/call
0.00%
0.00 bytes
0.00 bytes/sec
0.00 bytes/call
0.00%
0.00 bytes
0.00 bytes/sec
0.00 bytes/call
0.00%
0.00 ms
0s/sec
0s/call
0.00%
0.00 ms
0s/sec
0s/call
0.00%
SELECT COUNT(*) FROM "projects" INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id WHERE "projects"."mirror" = ? AND ("import_state"."status" NOT IN (?, ?)) AND (import_state.next_execution_timestamp <= ?) AND (import_state.retry_count <= ?)

(see the full Postgres checkup report for GitLab.com)

Might be related: https://gitlab.com/gitlab-org/gitlab-ce/issues/29218

Ideas:

  1. Basic idea (not really optimization, but reducing load on the master, allowing future scaling): move this SELECT to replicas to reduce impact on the master (might be not applicable, but still worth to consider);
  2. Reduce number of calls. The query looks like related to background scheduled jobs -- do we really need to execute this 30 times per second?
  3. Consider app-level caching (I suspect this idea will not work if it's really related to background jobs)
  4. Improve performance switching from JOIN with Nest Loop based on 2 Index scans to 1-table SELECT with a single Index Only Scan. To achieve this, we need to perform light denormalization: duplicate "projects"."mirror" flag in the project_mirror_data table (ideally, using a trigger), and get rid of projects in they query completely.
  5. Do we really need count(*) every time? I suppose that if count > 0, next step is to retrieve data itself. Isn't it better to read "next N" records avoiding count and proceeding to processing unless nothing is found?

(consider the latest item first).

Edited by Andreas Brandl