Track Markdown references to make rendering Markdown more efficient
Currently we always parse issue bodies, comments, etc because we might need to redact data. As of !9462 (merged) we now also have to check issue links to adjust the state in the Markdown. This means we're always parsing HTML, operating on it, then serialize it back to a String. In many cases we're doing nothing useful at all as private references are rare. As such we're just wasting a lot of time when displaying data.
To work around this I propose a system where we keep track of references from source objects (issues, comments, or merge requests) to target objects (issues or merge requests). In this system, whenever a comment refers to an issue (regardless of its state or visibility) we store a reference in a table. When the target object is updated (e.g. the visibility is changed), we flush the cached Markdown of all referring objects (using the references table).
Let's say we start out with tracking references for issues, comments, and merge requests. To ensure these stables can be vacuumed and updated efficiently I would store data in separate tables. Thus, we end up with these tables:
Each column would have a foreign key with a cascading delete, ensuring references are removed automatically whenever a source or target object is removed. Of these tables the target ID columns need to be indexed using a regular index.
An index on the source column (issue_id for example) might also be needed, but
this depends a bit on how we write the queries used for flushing data. For
example, if we use
WHERE notes.id IN (SELECT note_id FROM note_references ...)
we probably don't need an index; but I'm not 100% sure.
In this system, if you update an issue you run the following SQL queries to flush any caches of referring objects:
UPDATE notes SET note_html = NULL WHERE EXISTS ( SELECT true FROM note_references WHERE target_issue_id = CURRENT-ISSUE-ID AND notes.id = note_references.note_id ); UPDATE issues SET title_html = NULL, body_html = NULL WHERE EXISTS ( SELECT true FROM issue_references WHERE target_issue_id = CURRENT-ISSUE-ID AND issues.id = issue_references.issue_id ); -- etc
This can be done using a Sidekiq job so the flushing doesn't impact response timings. Here each "source" collection (issues, merge requests, etc) would be a separate job, allowing flushing of data in parallel.
With a setup such as the above, once Markdown has been rendered all we need to do is:
- Check if there are any private references (we can track this in a separate boolean column)
- If so, redact these references if necessary. If not, just display the HTML as-is (without parsing it at all).
This in turn means that for the vast majority of resources (or at least comments) we don't need to do anything after the initial rendering phase.
Why not use polymorphic associations?
There are a few reasons for this:
- We end up storing more data in a single table, meaning vacuuming will have to process more (thus taking more time).
- Querying referring objects becomes more painful.
- We can not use foreign keys to enforce consistency and automatically remove referring objects when target objects are removed.
- Because all data is in a single table this means the data resides on the same
disk. Using separate tables allows you to store tables on separate disks,
which can be very useful for large tables (e.g. our
For vacuuming: consider having 1 million references. Storing all of this in a
single table means having to process 1 million tuples every time we vacuum.
Storing this data in separate tables means we have
1 million / n-tables tuples
per table. Since different tables can be vacuumed in parallel this reduces the
total time spent vacuuming.
For querying: consider the following query to flush the cache of referring notes:
UPDATE notes SET note_html = NULL WHERE EXISTS ( SELECT true FROM note_references WHERE target_issue_id = CURRENT-ISSUE-ID AND notes.id = note_references.note_id );
Now compare this with the polymorphing equivalent:
UPDATE notes SET note_html = NULL WHERE EXISTS ( SELECT true FROM note_references WHERE target_id = CURRENT-ISSUE-ID AND target_type = 'Issue' AND notes.id = note_references.note_id );
There are a few problems with this:
- In our current setup we store class names for target types, meaning we can basically never rename them without writing a large migration.
- Querying requires another
- The target_type column probably will require another index.
- Foreign keys don't work with polymorphic associations, meaning we need to take care of removing referring objects ourselves (something PostgreSQL is much better at).
For foreign keys: there is no database that I know of that supports polymorphic
foreign keys. This means we need to manually remove referring objects (e.g.
delete_all method) when removing source objects. This has a
tendency to be very slow (e.g. it's currently almost impossible to remove a
project with a lot of events because of this), whereas foreign keys can handle
this very efficiently. Foreign keys also ensure references won't end up being
orphaned (a problem we have run into a lot in the past).
- For every reference of every referring object we insert 1 row. This means these tables can grow quite large (think millions and millions of rows).
- Migrating these tables (because of the above) will be a time consuming process. This means changes need to be done only when absolutely necessary.
- Flushing of caches has to be done using Sidekiq, as otherwise this could delay web response timings. This means the cache invalidation is eventually consistent. This however should be fine since caches are not critical data.
- We have to make sure that per object we only run a single INSERT for all references, and not one INSERT per reference.
- When a reference is removed we need to make sure any reference rows are removed.