Add worker to aggregate last 30-day catalog resource usage data (!155001) · Merge requests · GitLab.org / GitLab

Leaminn Ma requested to merge ci-components-last-30-day-usage-aggregation-worker into master Jun 01, 2024

What does this MR do and why?

Context:

In !151623 (merged), we introduced the Usages::Aggregator class. Its purpose is to aggregate CI component usage data in time-boxed batches, which allows us to avoid the performance problems that come with processing very large datasets. With this approach, we can run the aggregator in a scheduled cron job; on each run, the aggregator resumes processing from where it last left off.

This MR:

We utilize the Usages::Aggregator class in a new service that aggregates the last 30-day usage count for each catalog resource.
The service runs in a cron job worker that is scheduled every 4 minutes.
The usage counts are saved in catalog_resources.last_30_day_usage_count, along with the last_30_day_usage_count_updated_at timestamp.
Some minor refactoring was done on the Aggregator/Cursor classes and specs for clarity.

Database indexes:

This MR adds database indexes to help optimize the queries executed in the aggregation process.
Each query currently executes under 100ms on Prod without indexes. So we can add them in post-deploy migrations because they're non-critical to the feature at this time.

Resolves Step 2 of #452545 (closed).

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Create 3 catalog resources:

bundle exec rake "gitlab:seed:ci_catalog_resources[<YOUR-GROUP-PATH>, 3, true]"

Create a version for each catalog resource:

project_ids = Project.last(3).pluck(:id) # These should be the project IDs from the first step
user = User.first # This should be your root user

Project.where(id: project_ids).each do |project|
  Releases::CreateService.new(project, user, tag: '1.0.0', ref: 'main', legacy_catalog_publish: true).execute
end

For each catalog resource, create usages within the last 30 days:

expected_ordered_usage_counts = [12, 1, 7]
resources = Ci::Catalog::Resource.where(project_id: project_ids).order(:id)

resources.each_with_index do |resource, i|
  usage_count = expected_ordered_usage_counts[i]
  component = resource.components.first

  (1..usage_count).each do |k|
    Ci::Catalog::Resources::Components::Usage.create!(component: component, catalog_resource: component.catalog_resource, project: component.project, used_by_project_id: k, used_date: Date.today - k.days)
  end
end

Reset the usage counts and run the worker.

lease_key = Ci::Catalog::Resources::AggregateLast30DayUsageService.name

# Start with a fresh cursor
Gitlab::Redis::SharedState.with do |redis|
  redis.del("#{lease_key}:cursor")
end

# Reset the usage count values
Ci::Catalog::Resource.update_all(last_30_day_usage_count: 0, last_30_day_usage_count_updated_at: Date.today - 1.day)

# Run worker manually or wait until the next 4th minute
Ci::Catalog::Resources::AggregateLast30DayUsageWorker.perform_async

Observe that the saved last_30_day_usage_count values are equivalent to the expected usage counts.

ordered_usage_counts = resources.order(:id).pluck(:last_30_day_usage_count)
ordered_usage_counts == expected_ordered_usage_counts

Screenshot

Migrations

up.txt

down.txt

Query plans

1. `min` query on catalog_resources

There are currently ~2800 catalog resource records on Prod.
Index: The query plans show that it is beneficial to add an index on last_30_day_usage_count_updated_at.

SELECT MIN("catalog_resources"."last_30_day_usage_count_updated_at")
FROM "catalog_resources"

Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90245 Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90248

2. `each_batch` queries on catalog_resources

SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
ORDER BY "catalog_resources"."id" ASC LIMIT 1

Query plan: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90257

SELECT "catalog_resources"."id"
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 5
ORDER BY "catalog_resources"."id" ASC
LIMIT 1 OFFSET 1000

Query plan: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90260

SELECT "catalog_resources".*
FROM "catalog_resources"
WHERE (id >= 0)
AND "catalog_resources"."id" >= 1003000 AND "catalog_resources"."id" < 1005700
ORDER BY "catalog_resources"."id" ASC

Query plan: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28961/commands/90262

3. `distinct_each_batch` queries on p_catalog_resource_component_usages

There are currently ~45k usage records on Prod.
p_catalog_resource_component_usages is partitioned by month. Since the usage date range is a rolling 30-day window, it will at most cross two partitions.
Index: The query plans show that it is beneficial to add a composite index on catalog_resource_id, used_by_project_id, and used_date.

SELECT "p_catalog_resource_component_usages"."used_by_project_id"
FROM "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
AND (used_by_project_id > 0) 
ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
LIMIT 1;

Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89647
Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89657

WITH RECURSIVE "loose_index_scan_cte" AS (
  (
    SELECT "p_catalog_resource_component_usages"."used_by_project_id"
    FROM "p_catalog_resource_component_usages"
    WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
    AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
    AND (used_by_project_id > 0)
    AND "p_catalog_resource_component_usages"."used_by_project_id" >= 1
    ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC
    LIMIT 1
  )
  UNION
  (
    SELECT (
      SELECT "p_catalog_resource_component_usages"."used_by_project_id"
      FROM "p_catalog_resource_component_usages"
      WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001628
      AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-04-25' AND '2024-05-24'
      AND (used_by_project_id > 0)
      AND "p_catalog_resource_component_usages"."used_by_project_id" > "loose_index_scan_cte"."used_by_project_id"
      ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC LIMIT 1
    ) AS used_by_project_id
    FROM "loose_index_scan_cte"
  )
)
SELECT "used_by_project_id"
FROM "loose_index_scan_cte" AS "p_catalog_resource_component_usages"
WHERE "p_catalog_resource_component_usages"."used_by_project_id" IS NOT NULL
LIMIT 1 OFFSET 100;

Query plan without index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28760/commands/89721
Query plan with index: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28760/commands/89728

4. `update_all` query on catalog_resources.last_30_day_usage_count_updated_at

UPDATE "catalog_resources"
SET "last_30_day_usage_count_updated_at" = '2024-06-02 04:26:51'
WHERE (id >= 1003000 AND id <= 1005700); -- Real query uses an IN statement with 1000 ids in a batch

Query plan: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28727/commands/89661
- Not sure if there is much we can do to optimize this query. Note that it will only ever update a maximum batch size of 1000 catalog resources.

Related to #452545 (closed)

Edited Jun 07, 2024 by Leaminn Ma

Add worker to aggregate last 30-day catalog resource usage data

What does this MR do and why?

MR acceptance checklist

How to set up and validate locally

Migrations

Query plans

1. min query on catalog_resources

2. each_batch queries on catalog_resources

3. distinct_each_batch queries on p_catalog_resource_component_usages

4. update_all query on catalog_resources.last_30_day_usage_count_updated_at

Merge request reports

1. `min` query on catalog_resources

2. `each_batch` queries on catalog_resources

3. `distinct_each_batch` queries on p_catalog_resource_component_usages

4. `update_all` query on catalog_resources.last_30_day_usage_count_updated_at