Follow-up from "Add worker to aggregate last 30-day catalog resource usage data"

Summary

The following discussion from !155001 (merged) should be addressed:

  • @ahegyi started a discussion: (+1 comment)

    @lma-git, thanks! I ran the worker locally with a PRD replica and waited until it finished (took a few minutes). Let me know if you want to see the invoked queries, I can send over the logs.

    I have two concerns which we can address as a follow up:

    1: A lot's of queries with distinct_each_batch, example:

      Ci::Catalog::Resources::Components::Usage Load (216.8ms)  
    
    SELECT "p_catalog_resource_component_usages"."used_by_project_id" 
    FROM "p_catalog_resource_component_usages" 
    WHERE "p_catalog_resource_component_usages"."catalog_resource_id" = 1001657 AND
     "p_catalog_resource_component_usages"."used_date" BETWEEN '2024-05-08' AND '2024-06-06' AND
     (used_by_project_id > 0) 
    ORDER BY "p_catalog_resource_component_usages"."used_by_project_id" ASC LIMIT 1 
    /*application:console,db_config_name:main,console_hostname:gitlab,console_username:ahegyi,line:/app/models/concerns/each_batch.rb:121:in `distinct_each_batch'*/

    We can implement some sort of preloader query which checks if there is one row for the given catalog_resource_id with the given filter parameters.

    2: save_usage_counts! inserts unbounded number of rows. We might want to add an extra condition to the Aggregator to stop after collected N usage counts. --> Already bounded. Currently, it updates at most 1000 records per query. It runs save_usage_counts once for each TARGET_BATCH_SIZE it yields.

Edited by Leaminn Ma