Skip to content

DISTINCT queries on p_ci_job_artifacts are not performant

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Bug Summary

PostgreSQL optimizer fails to use existing indexes on partitioned table p_ci_job_artifacts for DISTINCT queries, resulting in full table scans of 40+ million rows. All indexes are created with the ONLY keyword on the parent table, making them unavailable to child partitions where actual data resides.

Steps to reproduce

  1. Run a DISTINCT query on the p_ci_job_artifacts table: SELECT DISTINCT project_id FROM p_ci_job_artifacts;
  2. Execute EXPLAIN ANALYZE to see the query plan: EXPLAIN (ANALYZE, BUFFERS) SELECT DISTINCT project_id FROM p_ci_job_artifacts;
  3. Observe that the query uses (Parallel) Seq Scan instead of available indexes
  4. Check existing indexes: SELECT * FROM pg_indexes WHERE tablename = 'p_ci_job_artifacts';
  5. I noted at all indexes show`CREATE INDEX ... ON ONLY public.p_ci_job_artifacts'. Does this mean that they are present only on the parent table, and thus not available on partitions?

What is the current bug behavior?

  • Query performs full sequential scan of 40+ million rows across 3 parallel workers. Execution time is ~4.8 seconds for retrieving 72,173 distinct values.
  • The issue compounds when using a Geo Secondary on Selective synchronization, which can generate several of these queries simultaneously.
  • These queries should use indexes instead of sequential scans?

What is the expected correct behavior?

  • Query should use one of the available project_id indexes, and perform an index scan or index-only scan
  • Minimal I/O (< 1000 blocks instead of 996,005)

Reproducibility

The performance impact occurs consistently when using Geo Selective synchronization, in combination with very large p_ci_job_artifacts tables.

Impact Assessment

Customers report databases running on 100% CPU cycles, and delayed Geo metrics updates.

Severity

P3 - This is a production performance issue, but affects only a specific use case on Geo replicas

Environment

  • Database: Issue first reported on Amazon RDS, but the query not using indexes is also seen on an omnibus deployment

Screenshots and/or Relevant logs

-- EXPLAIN output showing full table scan
EXPLAIN (ANALYZE, BUFFERS) SELECT DISTINCT "p_ci_job_artifacts"."project_id" FROM "p_ci_job_artifacts";

QUERY PLAN
Unique (cost=1293847.52..1293891.48 rows=4396 width=4) (actual time=4807.830..4832.049 rows=72173 loops=1)
 Buffers: shared hit=422556 read=573452, temp read=267 written=268
 -> Sort (cost=1293847.52..1293869.50 rows=8792 width=4) (actual time=4807.829..4818.622 rows=180936 loops=1)
     Sort Method: external merge Disk: 2136kB
     -> Gather (cost=1292348.40..1293271.55 rows=8792 width=4) (actual time=4753.504..4774.909 rows=180936 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         -> HashAggregate (cost=1291348.40..1291392.35 rows=4396 width=4) (actual time=4751.149..4760.372 rows=60312 loops=3)
             -> Parallel Append (cost=0.00..1249156.48 rows=16876766 width=4) (actual time=0.007..3231.578 rows=13403281 loops=3)
                 -> Parallel Seq Scan on ci_job_artifacts p_ci_job_artifacts_1 (cost=0.00..1164772.65 rows=16876765 width=4) 
                    (actual time=0.006..2227.397 rows=13403281 loops=3)
                    Buffers: shared hit=422553 read=573452
Execution Time: 4835.403 ms

The area of concern

# frozen_string_literal: true

module Geo
  module ReplicableCiArtifactable
    extend ActiveSupport::Concern

    class_methods do
      extend ::Gitlab::Utils::Override

      # @return [ActiveRecord::Relation<Ci::{PipelineArtifact|JobArtifact|SecureFile}>] scope
      #         observing selective sync settings of the given node
      override :selective_sync_scope
      def selective_sync_scope(node, **params)
        return all unless node.selective_sync?

        # The primary_key_in in replicables_for_current_secondary method is at most a range
        # of IDs with a maximum of 10_000 records between them. We can additionally reduce
        # the batch size to 1_000 just for pipeline artifacts and job artifacts if needed.
        replicables = params.fetch(:replicables, none)
        replicables_project_ids = replicables.distinct.pluck(:project_id) # rubocop:disable Database/AvoidUsingPluckWithoutLimit -- the query is already restricted to a range
        selective_projects_ids = ::Project.selective_sync_scope(node).id_in(replicables_project_ids).pluck_primary_key

        project_id_in(selective_projects_ids)
      end
    end
  end
end

It seems like this query can be reworked. Untested idea:

# frozen_string_literal: true

module Geo
  module ReplicableCiArtifactable
    extend ActiveSupport::Concern

    class_methods do
      extend ::Gitlab::Utils::Override

      # @param node [GeoNode] the secondary Geo site to sync to
      # @param params [Hash] has a replicables key. In a secondary Geo site, this key's value is
      #                      set to the relation: with_files_stored_locally
      # @return [ActiveRecord::Relation<Ci::{PipelineArtifact|JobArtifact|SecureFile}>] scope
      #         observing selective sync settings of the given node
      override :selective_sync_scope
      def selective_sync_scope(node, **params)
        return all unless node.selective_sync?

        replicables = params.fetch(:replicables, none)
        replicables.merge(project_id_in(::Project.selective_sync_scope(node)))
      end
    end
  end
end
Edited by 🤖 GitLab Bot 🤖