Determine partitioning strategy for CI/CD "active" data
Description
We are slowly starting to work on partitioning CI/CD data. We need to review our technical approach and strategy towards splitting CI/CD tables into many partitions. There are two competing solutions, and we need to make a well informed decision about which way we want to go. This probably needs some experiments that will allow us to make decision in a bit more data-informed way.
The two solutions are: introduce new partitioned schema and store "active" data in it vs introduce new partitioned schema and archive CI/CD data into it using retention policy. Initially we thought that the latter is going to be easier to introduce, but recently @NikolayS suggested that we should reconsider the former strategy as well, because it makes it possible to avoid migrating data (what is known to be tricky as well).
The two solutions are a bit different and have different pros and cons. This can be summarized to: with partitioning the active data set, we will need to update all SQL queries targeting ci_*
tables and make sure that there is a partitioning key included. This will most likely require extensive changes in the API and URLs. With partitioning the archived data only, we avoid the need of updating all the queries, but the real cost here is migrating millions of rows each day into the archived schema.
The question we need to find an answer for is: can we somehow combine the two strategies to get most of each one? Can we introduce a new partitioned schema that will also store "active" data, but still make most of the time-decay characteristics while avoiding the huge amount of work that a daily reconciliation worker would need to perform?
Outcome
The ideal outcome for this issue is a PoC that could demonstrate how can we extend ci_*
tables with a new, partitioned, schema, in a way that it is not necessary to rebuild a ton of SQL queries in many places (localized changes, perhaps on a different level of abstraction).