Due to the ongoing increased load issues on the primary, we need to take action to reduce the load. One of the solutions we're considering is partitioning tables > 100GB. As a very large table, merge_request_diff_commits has been selected as a candidate. For more information on available strategies, please review our docs.
We'd like to answer the following questions:
Does merge_request_diff_commits, merge_request_diff_files, merge_request_diffs, or merge_requests have keys that would be easy to partition on?
What strategy seems like the best bet for partitioning this table?
Are there other strategies we'd like to consider to reduce the size of merge_request_diff_commits?
Availability and Testing
Regression testing, please make sure e2e:package-and-test job is passing.
If you are unsure about the correct group, please do not leave the issue without a group label, and refer to
GitLab's shared responsibility functionality guidelines
for more information on how to triage this kind of issue.
This issue was automatically tagged with the label ~"group::authentication and authorization" by TanukiStan, a machine learning classification model, with a probability of 0.43.
If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.
@mnohr - I've assigned this to your team based on the feature category in the related database dictionary file. Due to the unprecedented load on the primary database, we're asking folks to make these assessments as quickly as possible so we can identify good candidates for partitioning. Thanks for your cooperation!
CC @kerrizor since we commented back and forth about this earlier.
I'd think using merge_request_diff_id as the partition key for merge_request_diff_commits and merge_request_id for merge_request_diffs seems to make most sense to me if we were to partition them. I'm not sure if there is any edge cases where we wouldn't query based on those keys though
For merge_request_diff_commits and merge_request_diff_files, we have merge_request_diff_id and relative_order as primary key. Will that be an issue with existing partitioning strategies?
I don't think we can use range partitioning for these tables because of how they're being queried (we may end up with 1 table per merge_request_diff_id which I don't think is ideal). So we'll need to go with list partitioning in my opinion.
How much easier would life be if we had merge_request_id on these records as well?
@kerrizor Not sure if it'll be that useful since merge_request_diff_commits and merge_request_diff_commits are associated to a record in merge_request_diffs. We don't really access these records directly from a merge request as far as I know.
@patrickbajao Hmm.. well, to expand on it, I was thinking specifically about range partitioning; if these records had merge_request_id (since that's the context in which they're grouped and interacted with) we'd end up with fairly well distributed partitions, I think?
@patrickbajao I was thinking of a partition on ranges of merge_request_ids.. but I see in the docs we're only set up for date range partitioning, which doesn't work for us. Are we able to partition on ranges of integers?
Perhaps we need to go upwards even higher, to project or group since all merge_request_diff* accesses all happen within the context of a single one of those. That's still a lot of partition tables.. what's our horizontal maximum?
Barring either of those paths.. what's left? Manual list partitioning or hash? Either way, is my understanding correct that we want associated data that appears within the same context to exist on the same partition?
I finally had a chance to think a bit more about this and I'm wondering why we couldn't use the range strategies using merge_request_diff_id as the partition key?
I know we just mentions date range in our doc, but It seems that integer values can also be used by according to the Postgres docs. Maybe there might be a reason we don't mention that in our docs, but I don't know why this would be any different to using date ranges.
I'm thinking something along the lines of partitioning it as:
Granted 10000 is an arbitrary number as an illustration, I think this seems to be the best possible way of partitioning this table as we query it by merge_request_diff_id mostly. I think we could calculate average data size and come up with a target range size if this is the right approach.
I suppose we could try hashing merge_request_diff_id with a fixed number of tables as well, but I think that fixed number of tables will eventually grow as well unless we choose really big number. That brings me to the question of how those underlying partitioned tables are generated. Are they created as we need them or do they need to be created manually prior to that? I'm assuming it gets created automatically by Postgres, but I might be wrong as I didn't find a clear answer to that yet.
@stomlinson Would you be able to let us know your thoughts on what I said above?
Also, what would be the downside of creating too many partitions? We're doing this to improve the database performance, but at what point we say it's doing more harm than good? Can we have some guidance around that as well?
@dskim_gitlab I think this is a great idea! Range partitioning by an integer is totally reasonable.
A similar idea is the sliding list partitioning strategy. It adds a default column and bumps the value as a new partition is created, so that inserts end up in the new partition. It has the advantage that we can very easily control the size of partitions (by creating a new one when the previous reaches a certain size), and that the system "fails safe" if partition management stops working - inserts will keep going to the current partition if partition management stops happening.
I think either would be a good fit for your case.
Also, what would be the downside of creating too many partitions? We're doing this to improve the database performance, but at what point we say it's doing more harm than good? Can we have some guidance around that as well?
This is highly dependent on the number of partitions that are accessed by a single query. If postgres is able to statically determine that it only needs to access a few partitions at a time, on the order of hundreds of partitions is acceptable. If every partition needs to be checked by a query, fewer partitions will have better performance. Here, with range partitioning, I think we'll want a lot of partitions.
With merge_request_diff_commits at over 3tb (2.4 tb of data, the rest indexes), and an average row size of about 183 bytes, we're going to want to have a lot of data in each partition. In general I'd recommend partition sizes around 10gb, so roughly 330 partitions to start. This is starting to approach a number of partitions that I'd be concerned about, so possibly we'd want to bump to 20gb of data in each, cutting the number of partitions in half.
If the table size that the Database groups are recommending is 100GB, is there any reason why the partition sizes cannot be 100GB? Is there a technical reason for them needing to be smaller?
I'd prefer to lean significantly smaller than 100 gb / table if the implementation strategy is the same as it would be at 100gb / table.
The database group is recommending 100gb as an upper limit for table sizing as a compromise between database scalability and ease of development. Once we've partitioned, if there's not an ease-of-development reason for large partition sizes, I recommend a size closer to 10gb. At an even smaller table size, postgres's vacuum behavior is much more resilient to various access patterns and will behave better during heavy workloads or database degradation.
@stomlinson Thanks a lot for the clarification around the target size for each partition!
A similar idea is the sliding list partitioning strategy. It adds a default column and bumps the value as a new partition is created, so that inserts end up in the new partition. It has the advantage that we can very easily control the size of partitions (by creating a new one when the previous reaches a certain size), and that the system "fails safe" if partition management stops working - inserts will keep going to the current partition if partition management stops happening.
That's an interesting approach. Does this mean that we would manually create a database migration occasionally while monitoring the size of the latest partition? I think that could potentially work for gitlab.com, but I'm not sure how that we can apply that to self-managed instances as they would have different sizes. How do we handle future. Do you have an example that I can take a look?
Also, Is there any helpers we can use to implement the range strategy with the integer keys? I didn't find any so I suppose we need to implement the helpers ourselves?
I took a look at the date range strategy as an example to understand what we should do. Date range strategy seems to create partitions up until the month that the partitioning migration gets run by looks of it, but how are we generating partitions for future months? Are we supposed to generate them manually into the future or is there any automated process I missed?
In addition, for merge_request_diff_commits and merge_request_diff_commits have merge_request_diff_id and relative_order as primary keys. Would that be a problem if we partitioned it just by merge_request_diff_id or should that be the same as the primary key? I feel it wouldn't matter for the purpose of partitioning, but it would be great if you can confirm.
@dskim_gitlab we have an automated process that handles both future date ranges and the sliding list strategy: the PartitionManager.
It runs every hour + every app boot (which ends up meaning it runs thousands of times a day on gitlab.com), and creates new partitions that are needed for these dynamic processes. It can also bump the current partition in the sliding list strategy case.
The web_hook_logs table uses range partitioning, defined here and backed by the MonthlyStrategy.
You can see a new partition being created for web_hook_logs at the beginning of the month here (it happens twice, once for the main and once for the ci database, and we precreate partitions 6 months out).
This is specific to dates right now, it assumes that the partition definitions are time partitions, but that could be changed here.
The loose_foreign_keys_deleted_records table, on the other hand, uses the sliding list strategy here.
You can see that a proc can be defined for when the next partition should be created. We can just skip the detach_partition_if proc if we never want to drop old partitions.
security_findings does the same thing and uses table sizes to move to the next partition, you can see it here.
Because you can precreate many future partitions, I think range partitioning would work well for this case and have less complexity than the sliding list partitioning strategy.
Also, Is there any helpers we can use to implement the range strategy with the integer keys? I didn't find any so I suppose we need to implement the helpers ourselves?
Unfortunately we don't have a helper for range partitioning with integer keys, but I think the date partitioning helpers could be pretty easily adapted to support integer keys as well as dates.
Would that be a problem if we partitioned it just by merge_request_diff_id or should that be the same as the primary key?
It's completely normal to only partition by one column of the primary key, and will work well in your case!
I kind of like the idea of sliding list strategy as we could control the partitioning based on the size of the active partition and also the fact that it would continue to work even if PartitionManager malfunctions. However, for that to work, I believe we would need to store the partition_number in the parent object in order for us to make a query that performs well, right? I think that could become quite cumbersome as we have multiple hierarchies for these tables also it would increase the total db size.
I'm leaning towards the integer range strategy and setup so that we create 1 full partitions ahead of the current active partition to give PartitionManager an ample time to catch up if there are any unexpected issues. According to my calculation setting 100 million rows as the partition size would roughly archive 20gb per partition.
I think we could use this approach for any merge_request tables with different row limits depending on the average row size.
@code-review-be Does anyone see any issues with this approach? It'd be great to get some feedbacks on this just in case I missed anything so that we can consider them before we invest much time on implementing this.
@mnohr Since this is a investigative issue, I'll mark this as workflowin review since I've proposed possible steps we could take above and waiting for feedbacks.
We also have plans to drop older data (see this epic). How would partitioning impact that? If we are partitioning based on id, which should be sequential, and then we delete all the older data, would we be left with empty partitions?
@dskim_gitlab the only issue that came up when I proposed this partitioning strategy previously was that a single MR might have merge request diffs in different partitions. Is that actually a problem, @stomlinson ?
We also have plans to drop older data (see this epic). How would partitioning impact that? If we are partitioning based on id, which should be sequential, and then we delete all the older data, would we be left with empty partitions?
@mnohr I think we have a mechanism to delete old tables in PartitionManager so I believe we should be able to adopt it to delete empty partitions if needed. I guess it depends on how we are planning to delete them, but handling tiny partitions due to the delete would also be something we should consider as well. Maybe we could merge them if/when that happens?
the only issue that came up when I proposed this partitioning strategy previously was that a single MR might have merge request diffs in diff
in different partitions.
@kerrizor Given the size of each partition I've suggested above, I'd think we wouldn't have too many partitions involved per MR. Do you have any specific query/UI where this could be an issue?
Anyways, I'd be keen to hear what @stomlinson thinks about this.
@dskim_gitlab if crossing partitions is an issue, I was thinking we could de-normalize the merge_request_id onto that model, so we would only ever be searching withing a single partition. (IF that's a problem.. I dunno, but @stomlinson and I are meeting later this week, I'll bring it up.)
@kerrizor Yes, I suppose that could be an option if we found that to be an issue, but I'd like to avoid de-normalizing if we can help it.
Have you noticed any part of our app that would query all merge_request_diff_commits for the given merge_request at once? At least for merge_request_diff_commits, it seems to be always queried by merge_request_diff_id and relative_order so it wouldn't be an issue as far as I can tell. However, I may have missed something so please let me know if you have anything specific in mind.
@dskim_gitlab Here are answers to your various questions in this thread. Let me know if I missed any!
However, for that to work, I believe we would need to store the partition_number in the parent object in order for us to make a query that performs well, right? I think that could become quite cumbersome as we have multiple hierarchies for these tables also it would increase the total db size.
Yes, you would need to copy the partition_number up the hierarchy. For this reason I think the range partitioning might work better for you.
I'm leaning towards the integer range strategy and setup so that we create 1 full partitions ahead of the current active partition to give PartitionManager an ample time to catch up if there are any unexpected issues. According to my calculation setting 100 million rows as the partition size would roughly archive 20gb per partition.
This sounds great to me!
@dskim_gitlab the only issue that came up when I proposed this partitioning strategy previously was that a single MR might have merge request diffs in different partitions. Is that actually a problem, @stomlinson ?
I don't think this should be a problem, but we should test it to be absolutely sure.
We also have plans to drop older data (see this epic). How would partitioning impact that? If we are partitioning based on id, which should be sequential, and then we delete all the older data, would we be left with empty partitions?
It would be pretty tricky to merge partitions, so I think we should pursue removing old data before partitioning.
@kerrizor discusses it in #398213 (comment 1388575477) and I agree - we might not even have to partition once we remove old data, so I suggest pursuing the deletion approach first.
I agree we should pursue deletion approach first and consider this again if this is still an issue after that.
@mnohr Should we close this now as we have a better understanding of what we should do? I wonder if it makes sense to create an issue for partitioning if we may not go down that path if we delete enough records.
Hey @kerrizor@dskim_gitlab@patrickbajao@mnohr Thanks for the great discussion so far - I definitely don't want to move the end goals here, but I want to make sure that I'm sharing the proper context.
TL;DR: We need to get all of our tables under 100GB.
What is the reason for this goal? As a result of our continued growth on gitlab.com, we're seeing unprecedented Primary CPU saturation. By reducing table sizes, through partitioning or other means, we reduce some of the basic overheads that postgres needs in order to operate the database. Functions like vacuuming tables, or even simple writes are more complicated when performed on large tables. Reads also get easier since the smaller tables have less data to sift through.
Why 100GB? It's a large round number. Ideally, we'd actually like our tables to be even smaller than that. We'd prefer tables that are less than 10GB in size, but we decided on 100GB since it limits the number of tables involved in the effort and allows us to act on the largest tables.
So, what's the ask here? All of the issues we filed around this have seen some substantial discussion. If you haven't been able to determine a strategy for partitioning, (or even if you have), we'd like you to consider any and all other ways we'd be able to reduce the table sizes.
Reduce data retention - deleting old data may be viable, though we'll have to be careful since this can actually cause it's own set of issues when deleting a lot of records
Break up tables using single table inheritance into separate tables. This is akin to partitioning in a way, but would allow us to further partition in the future if needed.
Optimize indexes - Does your table have large indexes that are redundant or might not be used? Consider dropping them!
Optimize data types - Does your table store something as text that could be an enum? Converting it to an integer could result in substantial space savings.
Normalization - Review relational modeling and apply normalization techniques to remove duplicate data
Vertical table splitting - If your table has some large columns, consider splitting them out into their own tables instead.
Externalize - Is this data we could store somewhere else like Object storage? If it's a large blob of data that may make the most sense.
If you discovered a way we may be able to partition the table, please consider if any of the options above would be better than partitioning. If you didn't, please consider if any of the above would be viable options instead. Once done, please assemble a set of issues outlining your plan to reduce the table size and link it here, then you can feel free to close this issue!
What we're looking for is a plan to break these tables down so that we can try to get our primary database under control as we continue to grow! Thanks so much for your help so far and into the future!
@alexives thanks for sharing more information! I have some questions for clarification...
deleting old data may be viable, though we'll have to be careful since this can actually cause it's own set of issues when deleting a lot of records
Aside from possibility of breaking some features (e.g. old pages will start to return 404 for non-existing records), will there be a performance implication if we delete a lot of records in batches? Will it affect VACUUM performance since it might trigger autovacuum more often?
Break up tables using single table inheritance into separate tables. This is akin to partitioning in a way, but would allow us to further partition in the future if needed.
It's not clear to me how this can work since STI works on the same table. Are we talking about multiple table inheritance here?
will there be a performance implication if we delete a lot of records in batches?
Yes @patrickbajao, definitely. If we decide to do this we'll have to be very mindful about how we delete the data and it may not be very fast. The resulting dead tuple ratio could cause serious issues with those tables.
It's not clear to me how this can work since STI works on the same table
Single table inheritance is where multiple models pointing at the same database table. Instead, we suggest that tables using STI are broken into a table for each associated model. They can use shared model code in base classes, but would point at their own tables.
In some ways, partitioning could be a sort of be multiple table inheritance, since it's many tables with the same columns pointing at the same model, as opposed to a single table with the same columns pointing at multiple models.
Alex Iveschanged title from Consider partitioning strategies for merge_request_diff_commits to Consider partitioning strategies for merge_request tables
changed title from Consider partitioning strategies for merge_request_diff_commits to Consider partitioning strategies for merge_request tables
We're a little stymied on this honestly because of this discussion; the overlap between the two issues diverted our focus a bit, and now with the LLM push I haven't revisited to do a round up. I believe we can begin to make progress on this in %16.0; beyond just scheduling, the biggest hurdle is answering the question:
(Sub question, how much of a reduction do we expect that to produce)
..because if we can answer that, we can also give Gitaly guidance on the expected increase in access requests forcing stale (older than 6 months) data through a one-time regeneration step through a Gitaly request; I believe it to be virtually undetectable, but I'd really like some hard numbers there before we start deleting records. Given the discussion on this (&8582 (comment 1324492493)) I don't think it's an issue, so we really just need to schedule Next Steps.
However, that part of the discussion is really about understanding the access patterns and lifecycle of an MR. We could pretty "simply" answer the question of the size of the reduction by quantifying the size of the data associated with MRs that have been closed or merges > 6 months ago; whatever's left is the new floor of our data size. I've been told that this is impossible to gather from the database given it's current size; is that true, or is this a number we can generate? If so, I would love to have it, and really would love to work with someone to figure that number out. It seems like a fairly straight-forward data query...
What other options do we want to pursue, and are there issues/epics for those efforts yet?
Other than some vague conversations about the specifics of partitioning, in order to reduce the size of these tables, isn't the only answer "stop storing so much data"? We've honestly had more divergent ideas on how to partition the data, but nothing that has gotten enough response to warrant opening a new line of inquiry, such as migrating the stale data into new archive_* tables that are designed to be well-partitioned, or to add additional data to the merge_request_diff_* tables that would make them able to be partitioned.
If we only could choose one large data project to take on this quarter, from your database-focused and -aware perspective, which would be the higher priority? Improving our load issues, or reducing our data footprint?
However, that part of the discussion is really about understanding the access patterns and lifecycle of an MR. We could pretty "simply" answer the question of the size of the reduction by quantifying the size of the data associated with MRs that have been closed or merges > 6 months ago; whatever's left is the new floor of our data size. I've been told that this is impossible to gather from the database given it's current size; is that true, or is this a number we can generate? If so, I would love to have it, and really would love to work with someone to figure that number out. It seems like a fairly straight-forward data query...
This would take a few hours or more, but I see no reason why we couldn't run the query against database lab or the delayed replica! I can help you run this query if you have one in mind, I'm not particularly familiar with the table / column structure involved.
If we only could choose one large data project to take on this quarter, from your database-focused and -aware perspective, which would be the higher priority? Improving our load issues, or reducing our data footprint?
This is a good question, and a difficult one to answer!
Reducing data footprint will improve vacuum performance, which can lead to improvements to cpu (less time spent vacuuming), and also improvements to query time (less dead data to be processed during queries, both because it was deleted and because vacuum keeps up better).
However, it's possible that there are queries that are consuming a lot of CPU time, and that could be quicker wins.
I opened #408517 (closed) today to get better details about how much cpu time each query is taking, I'll have a better answer for this once we have that information in a few days, and I think at that time we can answer your question a lot better.
I can help you run this query if you have one in mind, I'm not particularly familiar with the table / column structure involved.
@stomlinson I appreciate the assist! I think we're somewhat safe in assuming that MR "size" is, effectively, constant, so to answer the question RE how much data we'd be deleting, I think what gets us there is generating a row count for both merge_request_diff_commits and merge_request_diff_files for merge_requests that were merged or closed in the last 6 months; this would show us the data we would be retaining.
Both of these tables have merge_request_diff_id as a FK, which in turn has merge_request_id.
MRs are merged or closed based on their state_id -- 2 is closed, 4 is merged. The actual timestamps for these states are on merge_request_metrics as last_closed_at and merged_at respectively.
Clear as mud? Let me know, I'm happy to schedule some sync time to chat more about this. Getting the number of rows should allow us to do some basic math to get a "good enough" answer to our space savings question.
@kerrizor I tried running these today, I got a late start and had to cancel them after 3 hours without a result. When I'm back next week I'll think about how we can break the queries up so that we can run them over chunks of the table and get results incrementally.
@kerrizor I split this to an incremental process and am running it against database lab. A rough estimate is that it'll take 24 hours of runtime to answer each of the two queries, so I'll have answers in a few days.
@kerrizor that script took so long to run, and was only halfway done, so I tried an alternate approach based on postgres table statistics and when the updated_at column changes for a merge request.
Take a look in #396795 (comment 1402461003) (internal note because it contains data about row counts and how often MRs are updated).
@mnohr Is there actually a deliverable here? The issue suggest we consider partitioning strategies... I think we've considered them, but maybe it's not actionable at this time? Maybe that's pending the answer to @kerrizor's question for @alexives:
If we only could choose one large data project to take on this quarter, from your database-focused and -aware perspective, which would be the higher priority? Improving our load issues, or reducing our data footprint?
@phikai As of now, I am thinking we are going to need to do a combination of things to really have an impact on the size (and therefore performance) of our very large tables. This should include continuing the investigation into removing older data, but then also continuing the investigation here on what partitioning would look like.
I think the outcome of this issue is just to answer the questions in the description, mainly:
With that information, we can help make a decision on if partitioning really makes sense. I'm hoping @dskim_gitlab can take a look and answer those questions.
Thanks for the collaboration from many people on this issue. I am going to close this issue for now with the following findings:
We have an idea of how we would partition the tables when we need to. See this thread that talks about a sliding strategy based on IDs / size.
We think it is more important to try to delete older/unused data before we start doing the partitioning. See Trim merge_request_diff_* records for outdated ... (&8582 - closed) for more details on that effort. It would be difficult to merge partitions (see this comment) if we were to first partition the tables and then deleted data.
@phikai Tagging you so we can consider this when we prioritize the work for the next milestone. I think trimming old records was already high on the list of priorities.