Skip to content

Add worker to aggregate last 30-day catalog resource usage data

What does this MR do and why?

Context:

In !151623 (merged), we introduced the Usages::Aggregator class. Its purpose is to aggregate CI component usage data in time-boxed batches, which allows us to avoid the performance problems that come with processing very large datasets. With this approach, we can run the aggregator in a scheduled cron job; on each run, the aggregator resumes processing from where it last left off.

This MR:

  • We utilize the Usages::Aggregator class in a new service that aggregates the last 30-day usage count for each catalog resource.
  • The service runs in a cron job worker that is scheduled every 4 minutes.
  • The usage counts are saved in catalog_resources.last_30_day_usage_count, along with the last_30_day_usage_count_updated_at timestamp.
  • Some minor refactoring was done on the Aggregator/Cursor classes and specs for clarity.

Database indexes:

  • This MR adds database indexes to help optimize the queries executed in the aggregation process.
  • Each query currently executes under 100ms on Prod without indexes. So we can add them in post-deploy migrations because they're non-critical to the feature at this time.

Resolves Step 2 of #452545 (closed).

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

  1. Create 3 catalog resources:
bundle exec rake "gitlab:seed:ci_catalog_resources[<YOUR-GROUP-PATH>, 3, true]"
  1. Create a version for each catalog resource:
project_ids = Project.last(3).pluck(:id) # These should be the project IDs from the first step
user = User.first # This should be your root user

Project.where(id: project_ids).each do |project|
  Releases::CreateService.new(project, user, tag: '1.0.0', ref: 'main', legacy_catalog_publish: true).execute
end
  1. For each catalog resource, create usages within the last 30 days:
expected_ordered_usage_counts = [12, 1, 7]
resources = Ci::Catalog::Resource.where(project_id: project_ids).order(:id)

resources.each_with_index do |resource, i|
  usage_count = expected_ordered_usage_counts[i]
  component = resource.components.first

  (1..usage_count).each do |k|
    Ci::Catalog::Resources::Components::Usage.create!(component: component, catalog_resource: component.catalog_resource, project: component.project, used_by_project_id: k, used_date: Date.today - k.days)
  end
end
  1. Reset the usage counts and run the worker.
lease_key = Ci::Catalog::Resources::AggregateLast30DayUsageService.name

# Start with a fresh cursor
Gitlab::Redis::SharedState.with do |redis|
  redis.del("#{lease_key}:cursor")
end

# Reset the usage count values
Ci::Catalog::Resource.update_all(last_30_day_usage_count: 0, last_30_day_usage_count_updated_at: Date.today - 1.day)

# Run worker manually or wait until the next 4th minute
Ci::Catalog::Resources::AggregateLast30DayUsageWorker.perform_async
  1. Observe that the saved last_30_day_usage_count values are equivalent to the expected usage counts.
ordered_usage_counts = resources.order(:id).pluck(:last_30_day_usage_count)
ordered_usage_counts == expected_ordered_usage_counts
Screenshot Screenshot_2024-06-01_at_7.14.47_PM

Migrations

up.txt

down.txt

Query plans

1. min query on catalog_resources

  • There are currently ~2800 catalog resource records on Prod.
  • Index: The query plans show that it is beneficial to add an index on last_30_day_usage_count_updated_at.
SELECT MIN("catalog_resources"."last_30_day_usage_count_updated_at")
FROM "catalog_resources"

Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90245 Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90248

2. each_batch queries on catalog_resources

SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
ORDER BY "catalog_resources"."id" ASC LIMIT 1
SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 5
ORDER BY "catalog_resources"."id" ASC
LIMIT 1 OFFSET 1000
SELECT "catalog_resources".*
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 1003000 AND "catalog_resources"."id" < 1005700
ORDER BY "catalog_resources"."id" ASC

3. distinct_each_batch queries on p_catalog_resource_component_usages

  • There are currently ~45k usage records on Prod.
  • p_catalog_resource_component_usages is partitioned by month. Since the usage date range is a rolling 30-day window, it will at most cross two partitions.
  • Index: The query plans show that it is beneficial to add a composite index on catalog_resource_id, used_by_project_id, and used_date.
SELECT "p_catalog_resource_component_usages"."used_by_project_id"
FROM "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
AND (used_by_project_id > 0) 
ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
LIMIT 1;
WITH RECURSIVE "loose_index_scan_cte" AS (
  (
    SELECT "p_catalog_resource_component_usages"."used_by_project_id"
    FROM "p_catalog_resource_component_usages"
    WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
    AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
    AND (used_by_project_id > 0)
    AND "p_catalog_resource_component_usages"."used_by_project_id" >= 1
    ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
    LIMIT 1
  )
  UNION
  (
    SELECT (
      SELECT "p_catalog_resource_component_usages"."used_by_project_id"
      FROM "p_catalog_resource_component_usages"
      WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
      AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
      AND (used_by_project_id > 0)
      AND "p_catalog_resource_component_usages"."used_by_project_id" > "loose_index_scan_cte"."used_by_project_id"
      ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC LIMIT 1
    ) AS used_by_project_id
    FROM "loose_index_scan_cte"
  )
)
SELECT "used_by_project_id"
FROM "loose_index_scan_cte" AS "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."used_by_project_id" IS NOT NULL
LIMIT 1 OFFSET 100;

4. update_all query on catalog_resources.last_30_day_usage_count_updated_at

UPDATE "catalog_resources"
SET "last_30_day_usage_count_updated_at" = '2024-06-02 04:26:51'
WHERE (id >= 1003000 AND id <= 1005700); -- Real query uses an IN statement with 1000 ids in a batch

Related to #452545 (closed)

Edited by Leaminn Ma

Merge request reports