Backend: Reduce data retention period and store aggregate data in long term table
Summary
In #440382 (closed), we implemented component usage instrumentation that tracks when a project includes a component in a pipeline. The data is recorded in p_catalog_resource_component_usages
with a daily breakdown.
The problem with keeping daily usage data for all time is that it will eventually cause performance and bloat issues on the table. The table is currently partitioned which will delay this problem, but it's not ideal to keep this granular data for long term. In !145881 (merged), we set the retention period to be 12 months, but we should shorten it to 2-3 months.
Since we may support an internal component usage analytics dashboard in the future (#440382 (comment 1793005911)), we should:
- Aggregate the data on a monthly basis and then store it into a new table (also partitioned) for long term historical data storage.
- Update the model logic for
p_catalog_resource_component_usages
so that it drops partitions containing old data ("old" meaning data we've already aggregated and stored in the long term table.)
Context
Ref: #440382 (comment 1784043885)
If we only need to calculate a 30 day report then perhaps we can drop the records after some threshold. Perhaps we always keep the last 60 days.
There can also be multiple levels of aggregation if we want to keep coarse metrics for a longer window of time. Every 30 days you could summarise all the data into counts which are stored in another table that is kept for a much longer period of time.
Proposal
Implement a worker that aggregates the data (with the necessary breakdowns) every month and record it into a new partitioned table for long term storage.
Update the Ci::Catalog::Resources::Components::Usage
model so that it drops partitions with data older than 2-3 months.
- If we decide to not proceed with the internal usage analytics dashboard feature, then we just need to do this step.