Geo: Convert to using PostgreSQL CTEs for project finders
From https://gitlab.com/gitlab-org/gitlab-ee/issues/5731#note_70179122, I think using PostgreSQL CTEs for many of our FDW queries would significantly reduce the amount of data we need to transfer from the gitlab_secondary
database:
With CTE
WITH atable AS (
SELECT "gitlab_secondary"."projects"."id" AS id, "gitlab_secondary"."projects".repository_storage, gitlab_secondary.projects.last_repository_updated_at FROM "gitlab_secondary"."projects" WHERE id > 4000000
)
SELECT atable.id FROM atable LEFT OUTER JOIN project_registry ON project_registry.project_id = atable.id WHERE "project_registry"."project_id" IS NULL AND atable.repository_storage = 'nfs-file04' AND (1=1) ORDER BY atable.last_repository_updated_at DESC LIMIT 1000;
Limit (cost=355458.87..355461.37 rows=1000 width=12) (actual time=17480.613..17480.613 rows=0 loops=1) WHERE "project_registry"."project_id" IS NULL AND atable.repository_storage = 'nfs-file04' AND (1=1) ORDER Output: atable.id, atable.last_repository_updated_at
CTE atable
-> Foreign Scan on gitlab_secondary.projects (cost=100.00..293914.72 rows=1849965 width=44) (actual time=4.281..16883.684 rows=1749619 loops=1)
Output: projects.id, projects.repository_storage, projects.last_repository_updated_at
Remote SQL: SELECT id, repository_storage, last_repository_updated_at FROM public.projects WHERE ((id > 4000000))
-> Sort (cost=61544.15..61555.71 rows=4625 width=12) (actual time=17480.613..17480.613 rows=0 loops=1)
Output: atable.id, atable.last_repository_updated_at
Sort Key: atable.last_repository_updated_at DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop Anti Join (cost=0.43..61290.57 rows=4625 width=12) (actual time=17480.607..17480.607 rows=0 loops=1)
Output: atable.id, atable.last_repository_updated_at
-> CTE Scan on atable (cost=0.00..41624.21 rows=9250 width=12) (actual time=2388.199..17480.552 rows=2 loops=1)
Output: atable.id, atable.repository_storage, atable.last_repository_updated_at
Filter: ((atable.repository_storage)::text = 'nfs-file04'::text)
Rows Removed by Filter: 1749617
-> Index Only Scan using index_project_registry_on_project_id on public.project_registry (cost=0.43..2.46 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=2)
Output: project_registry.project_id
Index Cond: (project_registry.project_id = atable.id)
Heap Fetches: 2
Planning time: 1.643 ms
Execution time: 17492.970 ms
(22 rows)
vs:
Without CTEs
EXPLAIN ANALYZE VERBOSE SELECT "gitlab_secondary"."projects"."id" FROM "gitlab_secondary"."projects" LEFT OUTER JOIN project_registry ON project_registry.project_id = gitlab_secondary.projects.id WHERE "project_registry"."project_id" IS NULL AND "gitlab_secondary"."projects"."repository_storage" = 'nfs-file01' AND (gitlab_secondary.projects.id > 4000000) ORDER BY "gitlab_secondary"."projects"."last_repository_updated_at" DESC LIMIT 1000
Limit (cost=275308.09..277591.04 rows=1000 width=12) (actual time=24202.978..24202.978 rows=0 loops=1)
Output: projects.id, projects.last_repository_updated_at
-> Nested Loop Anti Join (cost=275308.09..414423.91 rows=60937 width=12) (actual time=24202.978..24202.978 rows=0 loops=1)
Output: projects.id, projects.last_repository_updated_at
-> Foreign Scan on gitlab_secondary.projects (cost=275307.66..278049.83 rows=121874 width=12) (actual time=24202.899..24202.901 rows=2 loops=1)
Output: projects.id, projects.name, projects.path, projects.description, projects.created_at, projects.updated_at, projects.creator_id, projects.namespace_id, projects.last_activity_at, projects.import_url, projects.visibility_level, projects.archived, projects.import_status, projects.merge_requests_template, projects.star_count, projects.merge_requests_rebase_enabled, projects.import_type, projects.import_source, projects.avatar, projects.approvals_before_merge, projects.reset_approvals_on_push, projects.merge_requests_ff_only_enabled, projects.issues_template, projects.mirror, projects.mirror_last_update_at, projects.mirror_last_successful_update_at, projects.mirror_user_id, projects.import_error, projects.ci_id, projects.shared_runners_enabled, projects.runners_token, projects.build_coverage_regex, projects.build_allow_git_fetch, projects.build_timeout, projects.mirror_trigger_builds, projects.public_builds, projects.pending_delete, projects.last_repository_check_failed, projects.last_repository_check_at, projects.container_registry_enabled, projects.only_allow_merge_if_pipeline_succeeds, projects.has_external_issue_tracker, projects.repository_storage, projects.request_access_enabled, projects.has_external_wiki, projects.repository_read_only, projects.lfs_enabled, projects.description_html, projects.only_allow_merge_if_all_discussions_are_resolved, projects.repository_size_limit, projects.service_desk_enabled, projects.printing_merge_request_link_enabled, projects.auto_cancel_pending_pipelines, projects.import_jid, projects.cached_markdown_version, projects.last_repository_updated_at, projects.ci_config_path, projects.disable_overriding_approvers_per_merge_request, projects.delete_error, projects.storage_version, projects.resolve_outdated_diff_discussions, projects.remote_mirror_available_overridden, projects.only_mirror_protected_branches, projects.pull_mirror_available_overridden, projects.jobs_cache_index, projects.external_authorization_classification_label, projects.mirror_overwrites_diverged_branches, projects.external_webhook_token, projects.pages_https_only
Remote SQL: SELECT id, last_repository_updated_at FROM public.projects WHERE ((id > 4000000)) AND ((repository_storage = 'nfs-file01'::text)) ORDER BY last_repository_updated_at DESC NULLS FIRST
-> Index Only Scan using index_project_registry_on_project_id on public.project_registry (cost=0.43..1.25 rows=1 width=4) (actual time=0.034..0.034 rows=1 loops=2)
Output: project_registry.project_id
Index Cond: (project_registry.project_id = projects.id)
Heap Fetches: 14
Planning time: 3.275 ms
Execution time: 24203.415 ms
(13 rows)
Without using CTEs, adding project_id
clauses as described in https://gitlab.com/gitlab-org/gitlab-ee/issues/5731 doesn't help performance.
Edited by Stan Hu