Follow-up from "Add worker to aggregate last 30-day catalog resource usage data"
Summary
The following discussion from !155001 (merged) should be addressed:
-
@ahegyi started a discussion: (+1 comment) @lma-git, thanks! I ran the worker locally with a PRD replica and waited until it finished (took a few minutes). Let me know if you want to see the invoked queries, I can send over the logs.
I have two concerns which we can address as a follow up:
1: A lot's of queries with
distinct_each_batch, example:Ci::Catalog::Resources::Components::Usage Load (216.8ms) SELECT "p_catalog_resource_component_usages"."used_by_project_id" FROM "p_catalog_resource_component_usages" WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001657 AND "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-05-08' AND '2024-06-06' AND (used_by_project_id > 0) ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC LIMIT 1 /*application:console,db_config_name:main,console_hostname:gitlab,console_username:ahegyi,line:/app/models/concerns/each_batch.rb:121:in `distinct_each_batch'*/We can implement some sort of preloader query which checks if there is one row for the given
catalog_resource_idwith the given filter parameters.2:--> Already bounded. Currently, it updates at most 1000 records per query. It runssave_usage_counts!inserts unbounded number of rows. We might want to add an extra condition to the Aggregator to stop after collected N usage counts.save_usage_countsonce for eachTARGET_BATCH_SIZEit yields.