Skip to content

Check maven path for maven file API endpoints [RUN ALL RSPEC] [RUN AS-IF-FOSS]

David Fernandez requested to merge 326819-check-maven-path into master

🍰 Context

The maven package registry exposes a few API endpoints so that maven clients (mainly $ mvn) can interact with it.

In addition, those endpoints are "duplicated" at different levels so that users can use the most appropriate one.

The endpoint we're going to focus on this MR is the one used to "pull" or download a given package file. Basically, the request goes like this: Hey package registry, given file X from package P. The backend will check across the available packages to locate the proper one and locate the right package file. (This part is not as trivial as it seems)

On the other hand, $ mvn has a given behavior for pulling dependencies. Let's say that we have a maven project with 5 dependencies and 3 package registries set up (maven central + gitlab.com + a third party registry). The $ mvn will not check the registries in the order they are declared. Instead, it will check them all at the same time. I'm not sure why is it so. 🤷 So for our example, 3 * 5 = 15 requests will be triggered.

So in other words, on the maven file API endpoint, gitlab.com is completely hammered with requests hey do you have this file for this package?.

This leads to this peculiar situation where the majority of the requests on those file endpoints don't have a 200 response but 404 response:

statuses

=> 60-75% of the requests on those endpoints are for package files that doesn't exist at all on gitlab.com

The main issue here is that the backend will still go through all the queries (through the groups/projects and packages) to end up with an empty result set and reply 404.

In #326819 (closed), we noticed a way to improve this situation:

  • One the arguments of that endpoint is the maven path. This path is actually persisted in the database on the packages_maven_metadata table.
    • The idea here is to add a preliminary query that will quickly check if the requested path exists in the table. If it exists, we let the backend do the work as it does today. If it doesn't exist (the majority of the requests), we can quickly reply 404.

This improvement should help with the horrible response times we see on those endpoints.

Note that those endpoints deal with two methods: GET and HEAD requests.

🤔 What does this MR do?

  • Add a preliminary check for the maven file API endpoints that will verify the existency of the given maven path in packages_maven_metadata
    • if the path doesn't exist, a 404 is directly returned
    • if the path exist, the backend will process the request the same way it's currently doing
  • Update the related specs
    • Update specs for the file API endpoint on all levels
    • Util functions were upgraded with keyword arguments for better readability
    • Util functions now accept a path argument that will be used in the contacted url.
    • Added several contexts for the case with a non existing maven path
  • Those file API endpoints are among the busiest endpoints for the ~"group::package" team. As such, the blast radius in case of a problem with this change is quite big. This change is gated behind a feature flag to provide an additional safety net for gitlab.com

📸 Screenshots (strongly suggested)

Let's have a look at the SQL queries triggered by the different levels, with / without this MR, with / without the feature flag enabled.

Level Conditions Number of SQL queries Difference
Project Before this MR
Path exists
11 Baseline
Project Before this MR
Path doesn't exists
10 Baseline
Project With this MR
FF disabled
Path exists
12 +1 query (the FF existence check)
Project With this MR
FF disabled
Path doesn't exist
11 +1 query (the FF existence check)
Project With this MR
FF enabled
Path exists
13 +2 queries (the FF existence check and the maven path existence check)
Project With this MR
FF enabled
Path doesn't exist
3 -7 queries 🚀
Group Before this MR
Path exists
11 Baseline
Group Before this MR
Path doesn't exist
10 Baseline
Group With this MR
FF disabled
Path exists
12 +1 query (the FF existence check)
Group With this MR
FF disabled
Path doesn't exist
13 +1 query (the FF existence check)
Group With this MR
FF enabled
Path exists
13 +2 queries (the FF existence check and the maven path existence check)
Group With this MR
FF enabled
Path doesn't exist
3 -7 queries 🚀
Instance Before this MR
Path exists
9 Baseline
Instance Before this MR
Path doesn't exist
8 Baseline
Instance With this MR
FF disabled
Path exists
10 +1 query (the FF existence check)
Instance With this MR
FF disabled
Path doesn't exist
9 +1 query (the FF existence check)
Instance With this MR
FF enabled
Path exists
11 +2 queries (the FF existence check and the maven path existence check)
Instance With this MR
FF enabled
Path doesn't exist
3 -5 queries 🚀

Notes

  • The MR impact is minimal (+1/+2 queries).
    • One of the added query is the feature flag existence check. This query is not triggered all the time. In addition, this query will disappear when the feature flag is removed
    • The other query is the path existence check. Given the execution time of this query, the impact is minimal
  • The most used scenario (feature flag enabled + path doesn't exist) gets a nice bonus of -5/-7 SQL queries.

In conclusion, this change properly accelerate the conditions we want to target (the requested package file doesn't exist) while minimally impacting the other conditions. The above analysis bring us more confidence in this change 🚀

📐 Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

💾 Database review

Migration up

$ rails db:migrate
== 20210413092922 AddIndexToPackagesMavenMetadataPath: migrating ==============
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
   -> 0.0079s
-- execute("SET statement_timeout TO 0")
   -> 0.0006s
-- add_index(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
   -> 0.0199s
-- execute("RESET ALL")
   -> 0.0007s
== 20210413092922 AddIndexToPackagesMavenMetadataPath: migrated (0.0300s) =====

Migration down

$ rails db:rollback
== 20210413092922 AddIndexToPackagesMavenMetadataPath: reverting ==============
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
   -> 0.0021s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- remove_index(:packages_maven_metadata, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently, :column=>:path})
   -> 0.0054s
-- execute("RESET ALL")
   -> 0.0005s
== 20210413092922 AddIndexToPackagesMavenMetadataPath: reverted (0.0094s) =====

Queries

Edited by David Fernandez

Merge request reports