Filtering by label with title deduplication
This is a follow-up to the great case study in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30741 by @smcgivern .
Picking up the Attempt C: store label titles for each object
attempt, how can we solve its disadvantages? Namely, moving labeled objects (issues, MRs) would result in (unacceptable) mass updates. Let's discuss the following approach:
Each Label
currently has a title
and an id
. Additionally, we'd create a deduplication table where we track unique title
values and assign a unique title_id
. Alternatively we can calculate a UUID-type-5 title_uuid
from the title
and don't need an additional table.
In order to label an issue, we'd store an array of title_id
s with the issue.
In order to search for issues with a label filter, we'd translate the query the same way and lookup issues by their array of title_id
s: issues.label_title_ids @> [10,20]
(inclusion). A GIN
index gives the ability to perform this search efficiently.
What about the original drawbacks?
We have to bulk update the objects when:
- The objects are moved from one project to another.
- The project is moved from one group to another.
No bulk update needed.
- The label is deleted.
Still problematic, but we can delay cleaning up the label_title_ids
until the next update to the issue (or the next relabeling) and re-check issues and their labels after performing the search (does the label actually exist?).
Can we defer the updates to a background job after label renaming?
- The label is renamed.
This still stands because the identifier we use (title_id
) is tied to the title
. It's inherent to denormalization when we allow changing the denormalized value.
Can we defer the updates to a background job after label renaming?
Advantages
- This approach is similar to
Attempt B: store label IDs for each object
but remedies the problem of searching by title (even across projects, groups). - It needs less storage compared to storing titles directly in denormalized fashion.
- Fast GIN index lookup
Drawbacks
- Need to maintain extra lookup table (solved by UUID approach)
- Bulk updates needed for label rename and deletion
Misc notes
We have about 350k distinct label titles on GitLab.com.