Skip to content

Improve Maven response times for 404s

Endpoint HEAD /api/:version/groups/:id/-/packages/maven/*path/:file_name

🍩 Context

The maven API file endpoints are hammered with requests.

Upon inspecting the response status for the last 24 hours, we get this:

Screenshot_2021-04-02_at_17.41.29

So it's safe to say that more than 75% of the traffic on those file endpoints don't get an actual file.

Why is it so?

This is coming from how mvn (the main client) works. When you setup multiple registries for your dependencies and pull them, mvn will loop on each registry and knock, knock do you have package X?

In other words, the GitLab maven package registry receives many requests for packages/files that are not hosted in GitLab but elsewhere. My guess is that the majority of them are in maven central but due to how mvn works, GitLab get "pinged" too.

💡 Proposal (Friday's Crazy Idea #8649134)

Given that those 404s account for more than 75% of the requests. What if we take a few moment to have a preliminary check:

  • given a maven path (the parameters we receive on the file endpoints), does a Packages::Maven::Metadatum record exists?
    • If no, we can right away reply 404s.
    • If yes, execute the actual endpoint logic as it is today.

That's a simple existence check. There is no authentication, authorization checks. There is no queries to the packages table. It's just a simple answer to a simple question: does the path my/path/bananas exists in the table?

Why can we take the luxury of checking the packages_maven_metadata table first? One word: size. The packages_maven_metadata table is not a busy one such as projects or packages. We thus have way less records = scanning that table with a maven path is super fast.

Actually, I'm thinking that this could be the explanation why the instance level endpoints are faster than the project level endpoints.

Edited by Tim Rizzi