Improve Maven response times for 404s
Endpoint HEAD /api/:version/groups/:id/-/packages/maven/*path/:file_name
🍩 Context
The maven API file endpoints are hammered with requests.
Upon inspecting the response status for the last 24 hours, we get this:
So it's safe to say that more than 75% of the traffic on those file endpoints don't get an actual file.
Why is it so?
This is coming from how mvn
(the main client) works. When you setup multiple registries for your dependencies and pull them, mvn
will loop on each registry and knock, knock do you have package X
?
In other words, the GitLab maven package registry receives many requests for packages/files that are not hosted in GitLab but elsewhere. My guess is that the majority of them are in maven central but due to how mvn
works, GitLab get "pinged" too.
💡 Proposal (Friday's Crazy Idea #8649134)
Given that those 404
s account for more than 75% of the requests. What if we take a few moment to have a preliminary check:
- given a maven path (the parameters we receive on the file endpoints), does a
Packages::Maven::Metadatum
record exists?- If no, we can right away reply
404
s. - If yes, execute the actual endpoint logic as it is today.
- If no, we can right away reply
That's a simple existence check. There is no authentication, authorization checks. There is no queries to the packages table. It's just a simple answer to a simple question: does the path my/path/bananas
exists in the table?
Why can we take the luxury of checking the packages_maven_metadata
table first? One word: size. The packages_maven_metadata
table is not a busy one such as projects
or packages
. We thus have way less records = scanning that table with a maven path is super fast.
Actually, I'm thinking that this could be the explanation why the instance level endpoints are faster than the project level endpoints.