Add worker to aggregate last 30-day catalog resource usage data
What does this MR do and why?
Context:
In !151623 (merged), we introduced the Usages::Aggregator
class. Its purpose is to aggregate CI component usage data in time-boxed batches, which allows us to avoid the performance problems that come with processing very large datasets. With this approach, we can run the aggregator in a scheduled cron job; on each run, the aggregator resumes processing from where it last left off.
This MR:
- We utilize the
Usages::Aggregator
class in a new service that aggregates the last 30-day usage count for each catalog resource. - The service runs in a cron job worker that is scheduled every 4 minutes.
- The usage counts are saved in
catalog_resources.last_30_day_usage_count
, along with thelast_30_day_usage_count_updated_at
timestamp. - Some minor refactoring was done on the Aggregator/Cursor classes and specs for clarity.
Database indexes:
- This MR adds database indexes to help optimize the queries executed in the aggregation process.
- Each query currently executes under 100ms on Prod without indexes. So we can add them in post-deploy migrations because they're non-critical to the feature at this time.
Resolves Step 2 of #452545 (closed).
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
- Create 3 catalog resources:
bundle exec rake "gitlab:seed:ci_catalog_resources[<YOUR-GROUP-PATH>, 3, true]"
- Create a version for each catalog resource:
project_ids = Project.last(3).pluck(:id) # These should be the project IDs from the first step
user = User.first # This should be your root user
Project.where(id: project_ids).each do |project|
Releases::CreateService.new(project, user, tag: '1.0.0', ref: 'main', legacy_catalog_publish: true).execute
end
- For each catalog resource, create usages within the last 30 days:
expected_ordered_usage_counts = [12, 1, 7]
resources = Ci::Catalog::Resource.where(project_id: project_ids).order(:id)
resources.each_with_index do |resource, i|
usage_count = expected_ordered_usage_counts[i]
component = resource.components.first
(1..usage_count).each do |k|
Ci::Catalog::Resources::Components::Usage.create!(component: component, catalog_resource: component.catalog_resource, project: component.project, used_by_project_id: k, used_date: Date.today - k.days)
end
end
- Reset the usage counts and run the worker.
lease_key = Ci::Catalog::Resources::AggregateLast30DayUsageService.name
# Start with a fresh cursor
Gitlab::Redis::SharedState.with do |redis|
redis.del("#{lease_key}:cursor")
end
# Reset the usage count values
Ci::Catalog::Resource.update_all(last_30_day_usage_count: 0, last_30_day_usage_count_updated_at: Date.today - 1.day)
# Run worker manually or wait until the next 4th minute
Ci::Catalog::Resources::AggregateLast30DayUsageWorker.perform_async
- Observe that the saved
last_30_day_usage_count
values are equivalent to the expected usage counts.
ordered_usage_counts = resources.order(:id).pluck(:last_30_day_usage_count)
ordered_usage_counts == expected_ordered_usage_counts
Migrations
Query plans
min
query on catalog_resources
1. - There are currently ~2800 catalog resource records on Prod.
-
Index: The query plans show that it is beneficial to add an index on
last_30_day_usage_count_updated_at
.
SELECT MIN("catalog_resources"."last_30_day_usage_count_updated_at")
FROM "catalog_resources"
Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90245 Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90248
each_batch
queries on catalog_resources
2. SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
ORDER BY "catalog_resources"."id" ASC LIMIT 1
SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 5
ORDER BY "catalog_resources"."id" ASC
LIMIT 1 OFFSET 1000
SELECT "catalog_resources".*
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 1003000 AND "catalog_resources"."id" < 1005700
ORDER BY "catalog_resources"."id" ASC
distinct_each_batch
queries on p_catalog_resource_component_usages
3. - There are currently ~45k usage records on Prod.
-
p_catalog_resource_component_usages
is partitioned by month. Since the usage date range is a rolling 30-day window, it will at most cross two partitions. -
Index: The query plans show that it is beneficial to add a composite index on
catalog_resource_id
,used_by_project_id
, andused_date
.
SELECT "p_catalog_resource_component_usages"."used_by_project_id"
FROM "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
AND (used_by_project_id > 0)
ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
LIMIT 1;
- Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89647
- Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89657
WITH RECURSIVE "loose_index_scan_cte" AS (
(
SELECT "p_catalog_resource_component_usages"."used_by_project_id"
FROM "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
AND (used_by_project_id > 0)
AND "p_catalog_resource_component_usages"."used_by_project_id" >= 1
ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
LIMIT 1
)
UNION
(
SELECT (
SELECT "p_catalog_resource_component_usages"."used_by_project_id"
FROM "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
AND (used_by_project_id > 0)
AND "p_catalog_resource_component_usages"."used_by_project_id" > "loose_index_scan_cte"."used_by_project_id"
ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC LIMIT 1
) AS used_by_project_id
FROM "loose_index_scan_cte"
)
)
SELECT "used_by_project_id"
FROM "loose_index_scan_cte" AS "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."used_by_project_id" IS NOT NULL
LIMIT 1 OFFSET 100;
- Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28760/commands/89721
- Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28760/commands/89728
update_all
query on catalog_resources.last_30_day_usage_count_updated_at
4. UPDATE "catalog_resources"
SET "last_30_day_usage_count_updated_at" = '2024-06-02 04:26:51'
WHERE (id >= 1003000 AND id <= 1005700); -- Real query uses an IN statement with 1000 ids in a batch
- Query plan: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89661
- Not sure if there is much we can do to optimize this query. Note that it will only ever update a maximum batch size of
1000
catalog resources.
- Not sure if there is much we can do to optimize this query. Note that it will only ever update a maximum batch size of
Related to #452545 (closed)