Skip to content

Backend: Reduce data retention period and store aggregate data in long term table

Summary

In #440382 (closed), we implemented component usage instrumentation that tracks when a project includes a component in a pipeline. The data is recorded in p_catalog_resource_component_usages with a daily breakdown.

The problem with keeping daily usage data for all time is that it will eventually cause performance and bloat issues on the table. The table is currently partitioned which will delay this problem, but it's not ideal to keep this granular data for long term. In !145881 (merged), we set the retention period to be 12 months, but we should shorten it to 2-3 months.

Since we may support an internal component usage analytics dashboard in the future (#440382 (comment 1793005911)), we should:

  1. Aggregate the data on a monthly basis and then store it into a new table (also partitioned) for long term historical data storage.
  2. Update the model logic for p_catalog_resource_component_usages so that it drops partitions containing old data ("old" meaning data we've already aggregated and stored in the long term table.)

Context

Ref: #440382 (comment 1784043885)

If we only need to calculate a 30 day report then perhaps we can drop the records after some threshold. Perhaps we always keep the last 60 days.

There can also be multiple levels of aggregation if we want to keep coarse metrics for a longer window of time. Every 30 days you could summarise all the data into counts which are stored in another table that is kept for a much longer period of time.

Proposal

Implement a worker that aggregates the data (with the necessary breakdowns) every month and record it into a new partitioned table for long term storage.

Update the Ci::Catalog::Resources::Components::Usage model so that it drops partitions with data older than 2-3 months.

  • If we decide to not proceed with the internal usage analytics dashboard feature, then we just need to do this step.

Implementation Table

Group Issue Link
backend Spike: Evaluate how to calculate number of time... (#438409 - closed)
backend Backend: Component usage instrumentation (#440382 - closed)
backend Backend: Implement daily worker to aggregate la... (#452545 - closed)
backend Backend: Add option to sort CI catalog resource... (#452620 - closed)
frontend Frontend: Show usage statistics and sort option... (#434333 - closed)
frontend Frontend: Present the number of times a single ... (#443632 - closed)
backend 👈 You are here
backend Backend: Implement individual component sorting... (#453975 - closed)
Edited by 🤖 GitLab Bot 🤖