Optimize high frequency query: "count(*) from projects join project_mirror_data", >50% of total_time on the master (gitlab.com)
On GitLab.com, on Postgres master we observe that the following query takes ~50% of total_time
according to pg_stat_statements (so ~50% of CPU time spent on the master server is related to this query):
SELECT
COUNT(*)
FROM
"projects"
INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id
WHERE
"projects"."mirror" = ?
AND ("import_state"."status" NOT IN (?, ?))
AND (import_state.next_execution_timestamp <= ?)
AND (import_state.retry_count <= ?)
This query's profile according to pg_stat_statements:
# | Calls | ▼ Total time | Rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | blk_read_time | blk_write_time | kcache_reads | kcache_writes | kcache_user_time_ms | kcache_system_time | Query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 192,469 30.36/sec 1.00/call 0.26% |
15,234,607.23 ms 2.403s/sec 79ms/call 49.21% |
192,469 30.36/sec 1.00/call 0.05% |
7,534,894,752 blks 1.19M blks/sec 39.15K blks/call 64.63% |
0 blks 0.00 blks/sec 0.00 blks/call 0.00% |
2,033 blks 0.32 blks/sec 0.01 blks/call 0.04% |
0 blks 0.00 blks/sec 0.00 blks/call 0.00% |
0.00 ms 0s/sec 0s/call 0.00% |
0.00 ms 0s/sec 0s/call 0.00% |
0.00 bytes 0.00 bytes/sec 0.00 bytes/call 0.00% |
0.00 bytes 0.00 bytes/sec 0.00 bytes/call 0.00% |
0.00 ms 0s/sec 0s/call 0.00% |
0.00 ms 0s/sec 0s/call 0.00% |
SELECT COUNT(*) FROM "projects" INNER JOIN project_mirror_data import_state ON import_state.project_id = projects.id WHERE "projects"."mirror" = ? AND ("import_state"."status" NOT IN (?, ?)) AND (import_state.next_execution_timestamp <= ?) AND (import_state.retry_count <= ?) |
(see the full Postgres checkup report for GitLab.com)
Might be related: https://gitlab.com/gitlab-org/gitlab-ce/issues/29218
Ideas:
- Basic idea (not really optimization, but reducing load on the master, allowing future scaling): move this SELECT to replicas to reduce impact on the master (might be not applicable, but still worth to consider);
- Reduce number of calls. The query looks like related to background scheduled jobs -- do we really need to execute this 30 times per second?
- Consider app-level caching (I suspect this idea will not work if it's really related to background jobs)
- Improve performance switching from JOIN with Nest Loop based on 2 Index scans to 1-table SELECT with a single Index Only Scan. To achieve this, we need to perform light denormalization: duplicate
"projects"."mirror"
flag in theproject_mirror_data
table (ideally, using a trigger), and get rid ofprojects
in they query completely. - Do we really need
count(*)
every time? I suppose that if count > 0, next step is to retrieve data itself. Isn't it better to read "next N" records avoiding count and proceeding to processing unless nothing is found?
(consider the latest item first).
Edited by Andreas Brandl