The taggings is one of the top 10 most large tables in the CI database. In it's current form we can't partition it using the regular workflow because it is a polymorphic table with ties to p_ci_builds and ci_runners. A different problem that comes out of this is that we don't have foreign keys to remove the data from this table once a job or runner is deleted, so we have a lot of orphan rows in there.
This table is created by the acts-as-taggable-on gem to store the links between the tags and our tables.
Improvement steps
vendor in acts-as-taggable-on gem
keep only the features that we use from acts-as-taggable-on
create two separate tables for converting the polymorphic relations to regular relations. p_ci_builds_tags(partitioned) and ci_runners_tags.
modify the application to write to the new tables when records are inserted into taggings with a FF guard, similar to how metadatable works.
modify the application to read from the new tables with fallback on the taggings table when the data is empty
backfill the new tables with data from taggings
cleanup and drop the taggings table
Note: Ensure the new tables have appropriate sharding keys for cells
@mbobin@drew I pulled this into the Cells Support epic for now, but it could also go into a Partitioning epic. I think it is really needed for both efforts. WDYT?
@carolinesimpson@mbobin Agreed, given the size of the taggings table, should this be prioritized with the ongoing partitioning effort? Or scheduled as a follow-up once we partition the 6 CI tables?
@cheryl.li@carolinesimpson I can help as a dedicated Verify & Database reviewer to speed it up this effort, but I don't have any particular preference for which epic it should belong to. 😇
Thanks @mbobin. Based on this, we should probably schedule in the Next 4-6 releases (e.g. after the current CI Data Partitioning effort is complete) - WDYT @carolinesimpson@rutshah?
@mbobin I haven't looked at the internals of AATO in many years, but do you think there's some amount of configuration we could do to split them up into separate tables?
@drew looks like you could specify the taggings table name, but that's still a polymorphic table without FKs and a way to inject the partitioning constraint. While the table name is configurable, its structure is not and it still remains a global table.
Based on current progress, I don't think we'll get to partitioning taggings til we finish the current tables in flight, e.g. starting some time in FY25-Q3 (July). @mbobin@tianwenchen WDYT?
Any rough effort estimates for partitioning this table?
Sounds good to me. I guess my rough estimate is 3 milestones, one milestones for all the changes to prepare and update the gem and create our own tables, and one for executing the backfill considering the table size of taggings, and one for cleanup.
@mbobin Can you help me break down this issue into several issues, as it'll take 3ish milestones to fully partition the table (hence the 5 weight)? Feel free to promote it into its own epic. As discussed, I'd like to get started on this in %17.2.
@mbobingrouptenant scale noticed that this issue is closed, but it is still mentioned as sharding_key_issue_url for 2 tables in their database dictionary file here and here.
Sharding key issue URLs are supposed to remain open until the work on sharding keys on that table is complete. If the work on this table is complete, sharding_key_issue_url can be removed from the yml.
Could you please make sure that either:
An open issue is present in the yml for tags and taggings table?
Or, if the work is complete, the sharding_key_issue_url is removed from the yml?
Unfortunately not right now, as we are pattern matching to a issue URL's regex over here.
I would imagine that this has been done because different tables would warrant a different amount of work, and would finish at different points in time and it would be much easier to pull out each table into their own issue and track them seperately rather than one epic that encapsulates everything.
That said, I don't see a problem in tracking work of multiple tables in a single epic too (we already have multiple tables that point to the same issue URL, see the dashboard). It is just that you'd need to make changes in the specs too to accomodate an epic's regex as a valid sharding_key_issue_url.