Comparing between GitLab 13.5.0 and the latest nightly (13.9.0-pre 16bea7edbd9) found that the Projects List API under load surprisingly is now consuming over 1 GB more memory:
13.5.0:
13.9.0-pre 16bea7edbd9:
The memory increase was first detected in a larger review of if Ruby 2.7 lead to increased memory usage comparing 13.5.0 and 13.8.0-pre 852ea7c0283 (full results here) and further tests were run today to confirm if that memory increase has maintained. Testing was done using our standard performance testing setup in lab conditions with GPT running against a 10k Reference Architecture at 200 RPS.
As shown above the endpoint today pushes each of the 3 GitLab Rails nodes to use around 13.5 GB of ram each where as back on 13.5.0 it only consumed around around 12 GB each. This is a notable increase of around 1.5 GB of Rails ram usage. As such, we thought it worth raising an issue to be investigated why this is the case. This is all the data we have to hand at the moment but access can be granted to our test environments to investigate further as required.
Hey @mksionek. Looking at today's test results against 15.4.0-pre 569e2476b4f higher memory usage does still look to be occurring:
Obviously it's been a little while since 13.9 so we can't account for any other general memory increases but today there's still an increase from around 12.4GiB up to 14.4GiB while the endpoint is being hit. After the test has completed the memory returns to similar levels.
Wanted to cross post this response I gave to a similar question on the Single Issue Details endpoint - #321262 (comment 1083592124)
TL;DR Perhaps have a look to see if this endpoint is doing anything obvious memory wise? If not, there's maybe a bigger effort required here to review memory across the board.
@lohrc The issue here is unknown and may take some time/effort to discover. I propose we create an issue to investigate the issue first, and then hopefully spin out a second follow up issue to solve whatever that problem may be.
@alexpooley Could we just use this issue for the investigation, since it contains all the details? I have renamed the title to reflect that. We can then spin off follow-up issues based on the results. With that in mind, could you try to give this issue a weight just for the investigation part of the effort?
Christina Lohrchanged title from Projects List API shows notably higher Rails memory usage under load to Investigate why Projects List API shows notably higher Rails memory usage under load
changed title from Projects List API shows notably higher Rails memory usage under load to Investigate why Projects List API shows notably higher Rails memory usage under load
Since this issue was floated with ~"group::application performance" on Slack, I have some early thoughts on this:
This is most likely unrelated to the Ruby upgrade. If the runtime were responsible, we would have seen similar degradation in production, where we are much more likely to run into extreme corners of resource utilization. I also do not recall any changes to MRI that would explain this, plus, it appears to only happen for one endpoint. I highly suspect this is due to a change introduced in the application that occurred around the same time.
In order to investigate, we cannot just look at process RSS (I had left a similar comment elsewhere); this is a very coarse metric and can be misleading, as the factors behind RSS growth are far too many to consider. It merely tells us that there might be an issue, but will never tell us what causes it.
It would be useful to post here how these measurements were taken. @grantyoung If our team had an executable script to see under which conditions memory growth was observed, we could then drill into metric data or inspect the heap. It could, for example, be interesting to see if or how per-thread allocations and GC stats changed between runs.
We don't have any further insight available. This was raised based on a requested review by the Memory team (now ~"group::application performance") as a high level indicator based on machine memory usage.
Since this was raised the product has gone through multiple releases as well as teams and assignments being changed and I don't have capacity to deep dive into this further. If we're not concerned by this (as well as no problems reported by customers) then maybe it's best to just close? It's maybe best to review memory usage more generally with the upcoming Ruby 3 upgrade to see if we're happy with it.
IIUC our team is trying to move away from this bottom-up sort of analysis without also understanding what the user impact is. For example, it is not clear to me in this case how the higher memory usage we started to see for this endpoint affects users. If it translates into higher latencies throughout (this could be due to knock-on effects such as an increase in memory kills due to OOMs), it could be cause for concern and a reason to investigate deeper. If not, then given the small size of our team there are likely more impactful things to work on.
I hope I am correctly reflecting what @rogerwoo and our team have been talking about lately -- we are still in the process of figuring some of these things out since Roger only joined us very recently and we also have a new EM joining (today, in fact! )