Skip to content

Ensure primary key index is used for Geo::Replicable

Stan Hu requested to merge sh-geo-use-cte-for-local-files into master

What does this MR do and why?

In gitlab-com/gl-infra/production#7756 (closed), we found that for a large ci_job_artifacts table, the query planner can flip between using one of two indexes to find local files:

  1. ci_job_artifacts_pkey
  2. index_ci_job_artifacts_on_file_store

The first index is always faster because for a given primary ID, there is only one row and hence filtering by file_store = 1 is fast.

In the worst case on GitLab.com, the second index would have to sift through over 22,000 rows to find a matching artifact with the given primary ID.

Even if Geo is not enabled, save_verification_details is called with every model that is replicated hooks into save. To ensure that the query is always fast, use a materialized CTE to look up by primary key ID and filter from there.

Resolves #374701 (closed)

Resolves #374718 (closed)

SQL Queries

Before

When the query planner and statistics are in a good state, this query works just fine.

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/12244/commands/43526 shows:

SELECT 1 AS one FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."file_store" = 1 AND "ci_job_artifacts"."id" = 10 LIMIT 1
 Limit  (cost=0.58..3.60 rows=1 width=4) (actual time=0.088..0.088 rows=0 loops=1)
   Buffers: shared hit=7
   I/O Timings: read=0.000 write=0.000
   ->  Index Scan using ci_job_artifacts_pkey on public.ci_job_artifacts  (cost=0.58..3.60 rows=1 width=4) (actual time=0.086..0.086 rows=0 loops=1)
         Index Cond: (ci_job_artifacts.id = 10)
         Filter: (ci_job_artifacts.file_store = 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=7
         I/O Timings: read=0.000 write=0.000
Time: 0.302 ms
  - planning: 0.196 ms
  - execution: 0.106 ms
    - I/O read: 0.000 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 7 (~56.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

When things go bad!

For a bad case, see gitlab-com/gl-infra/production#7756 (comment 1104365277):

gitlabhq_production=# explain ( analyze, buffers ) SELECT 1 AS one FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."file_store" = 1 AND "ci_job_artifacts"."id" = 3314157561 LIMIT 1 ;
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.58..2.10 rows=1 width=4) (actual time=30.011..30.012 rows=0 loops=1)
   Buffers: shared hit=18243
   ->  Index Scan using index_ci_job_artifacts_on_file_store on ci_job_artifacts  (cost=0.58..2.10 rows=1 width=4) (actual time=30.010..30.010 rows=0 loops=1)
         Index Cond: (file_store = 1)
         Filter: (id = '3314157561'::bigint)
         Rows Removed by Filter: 22630
         Buffers: shared hit=18243
 Planning Time: 0.128 ms
 Execution Time: 30.028 ms
(9 rows)

18,243 buffers were read. This by itself isn't terrible, but since CI job traces get saved over 15,000 times a minute (https://log.gprd.gitlab.net/goto/75ee0f90-3ab9-11ed-8656-f5f2137823ba), this can add up!

After

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/12244/commands/43528:

WITH "verifiables" AS MATERIALIZED
  (SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked", "ci_job_artifacts"."partition_id"
   FROM "ci_job_artifacts"
   WHERE "ci_job_artifacts"."id" = 10)
SELECT 1 AS one
FROM "verifiables" AS "ci_job_artifacts"
WHERE "ci_job_artifacts"."file_store" = 1
LIMIT 1
 Limit  (cost=3.59..3.62 rows=1 width=4) (actual time=0.055..0.056 rows=0 loops=1)
   Buffers: shared hit=7
   I/O Timings: read=0.000 write=0.000
   CTE verifiables
     ->  Index Scan using ci_job_artifacts_pkey on public.ci_job_artifacts ci_job_artifacts_1  (cost=0.58..3.59 rows=1 width=147) (actual time=0.052..0.053 rows=0 loops=1)
           Index Cond: (ci_job_artifacts_1.id = 10)
           Buffers: shared hit=7
           I/O Timings: read=0.000 write=0.000
   ->  CTE Scan on verifiables ci_job_artifacts  (cost=0.00..0.02 rows=1 width=4) (actual time=0.054..0.054 rows=0 loops=1)
         Filter: (ci_job_artifacts.file_store = 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=7
         I/O Timings: read=0.000 write=0.000
Time: 0.894 ms
  - planning: 0.782 ms
  - execution: 0.112 ms
    - I/O read: 0.000 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 7 (~56.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michael Kozono

Merge request reports