Currently merge_request_diff_commits and merge_request_diff_files are two of the largest tables on GitLab.com, weighing in at 288 GB and 740 GB respectively.
During the call, we discussed an alternative proposal of keeping the existing structure (for now) - ie full diffs, not deduplicated through blobs but conditionally migrating the diffs to object storage, possibly using a git-lfs like scheme in which the diff is replaced in the table with a pointer to a object storage location.
This approach would allow older diffs to be progressively moved over to object storage, allowing newer merge requests to continue to be stored in the database for performance reasons.
It would also make migrating gitlab.com's data much easier, possible via a long-running background migration.
This approach does not preclude future enhancements (such as deduplication) but is a smaller first step which would relieve some of the pain this table is currently causing.
It's also worth pointing out that these diffs are immutable and pretty opaque as far as the database is concerned (ie, we're not indexing the content inside the diff), which seems to make it a fairly good candidate for object storage.
Thanks @andrewn. I can see how this would work, it seems very sensible to me.
One question (although I suspect I should be the one who knows the answer, I do not!): do we typically address diffs one-at-a-time, or are there cases where we want to acquire a collection of diffs? If the latter, is there a scheme for efficiently pulling such a collection from the various object stores we'd support?
@nick.thomas@oswaldo let us know what you think, I was on the call just because I'd had a hand in a lot of this work, even though ~Create owns this area now.
@nick.thomas we always access these diffs in one of two modes:
A collection (on the first load of the merge request's diffs page).
A single file (when expanding a collapsed diff on the merge request's diffs page).
This can also be fetched from Gitaly if it doesn't exist in the DB; however, for older MRs that are the concerns here, if it's not in the DB, there's a good chance it's not in the repo either.
If the latter, is there a scheme for efficiently pulling such a collection from the various object stores we'd support?
Seems like we can use something as a prefix filter (at least on Google Cloud). With that in mind we can come up with a solution that all merge_request_diff_files_raw_diffs (example) will have the same prefix as the ancestor merge_request_diff. We may have other options for batch requesting, but that's one of them.
I like the idea of this being some sort of optional feature with a scheduled task that moves over old diffs. It helps large instances scale, but is completely optional for smaller instances. Would we expect much of a performance improvement of GitLab.com once a significant portion of diffs are migrated?
I like the idea of this being some sort of optional feature with a scheduled task that moves over old diffs. It helps large instances scale, but is completely optional for smaller instances. Would we expect much of a performance improvement of GitLab.com once a significant portion of diffs are migrated?
My hunch right now is that we might not see a huge performance benefit, since retrieving from object storage may require more overhead and network hops. We'd also have to think about data integrity and how to deal with the case when object storage is offline or objects are corrupted.
I model this as opening more headroom for our current postgres database setup to scale into, rather than it being a huge performance win on its own terms. Making it optional, as with our other object storage features, makes sense to me.
I agree with what Stan and Nick said earlier. Considering that those two tables make 50% of the overall database size today, it is beneficial to reduce their size. It for sure helps in general to manage the database (amount of WAL, shared buffers needed, backup sizes, ...), but also to manage these tables on their own (think data or schema migrations).
@abrandl thanks for adding your input! @stanhu, you seem concerned about the performance and operational implications of this approach. Would it be worth setting up a call and spending some time discussing this approach as well as alternatives?
@stanhu right, so you're not against the approach. I think we all agree then that this should be done in order to allow us to scale the Postgres instance on GitLab.com, rather than to improve performance.
@andrewn@Finotto I spoke with @nick.thomas and @DouweM about how we'd go about implementing this. The proposed approach would split development over two part, (1) support for reading diffs from object storage, Omnibus changes to configure this and some kind of script/rake task to manually migrate select data to object storage for testing, (2) automatic migration job to move old diffs out of the database to object storage.
In terms of scheduling, we won't be able start on this until 11.8 because of reduced capacity over Christmas and needing to meet security SLAs. I've created https://gitlab.com/gitlab-org/gitlab-ce/issues/54670 for phase 2 and scheduled that for the following release.
@jramsay@DouweM I've started looking at implementation for this. I'd like to re-use the same code as LFS objects, etc, for storing the diffs. This gets us, "for free", the ability to store external diffs on the filesystem instead of object storage!
I say "for free", but it's not actually the case - for instance, it implies the necessity of adding Geo support, etc. So I think that the right thing to do in the first stage is to re-use the code, but add a validation forbidding the on-filesystem case.
Awkwardness: merge_request_diff_files.id does not exist. We're also lacking other useful identifiers - no sha, the filenames are difficult to work with, etc.
Naively, to reuse the existing code, we get a row in uploads which needs a type + id. It's possible we can work around it by specifying the merge_request_diff.id and putting the relative_order in the path somewhere. It's also possible we could skip having the row in uploads, which I think would be very worthwhile.
We don't need the row in uploads, and indeed, LfsObject doesn't generate one.