Investigate: Potential inefficiency associated with requesting artifact data from non API endpoints
A recent deep dive into the GitLab Pages feature set, specifically artifacts_server
uncovered something seemingly bizarre that exposes a potential inefficiency in how our infrastructure is configured as well as how this feature is utilized. Let's look at the workflow associated with a call to the following example request:
https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html
The above curl can be repeated by simply browsing the GitLab UI at https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/browse and selecting the coverage.html
file. Or running a curl -v -L
.
The above calls' only purpose is to retrieve the user's golang test coverage report in HTML format created by go tool cover -html ...
. This example is a small file at 382KB.
tldr; user requests a file, which goes to rails, redirects to the pages service, which then goes back to rails for the same thing
- A user makes a request to https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html which is sent to our
WEB
deployment. -
Workhorse
receives this request and forward it ontoRails
- The
Projects::ArtifactsController
does a few checks, primarily around validating whether or not this should be serviced by thePages
feature and if yes, sends the user an HTTP 302 and location. - The user will then follow this redirect to https://gitlab-org.gitlab.io/-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html
- Pages responds by reaching back to Rails, handled by our
API
deployment for the same object the user requested in step 1. - The data is then sent to the user (potentially unsafely; more on this below)
sequenceDiagram
participant C as Client
participant WH as Workhorse
participant R as Rails
participant P as Pages
Note over C,WH: gitlab.com
C->>+WH: GET /gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html HTTP/2
WH->>+R: request is forwarded
R->>R: Projects::ArtifactsController [WEB]
R->>-WH: HTTP 302 - https://gitlab-org.gitlab.io/-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html
WH->>-C: response is forwarded
Note over C,P: *.gitlab.io
C->>+P: GET /-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html HTTP/2
P->>+R: get object [API]
R->>-P: return object
P->>-C: stream object
- Rails code responsible for generating an HTTP 302: https://gitlab.com/gitlab-org/gitlab/-/blob/a149b49de13147cf825ef4c45d08d3f27c63add6/app/models/ci/artifact_blob.rb#L52-57
- Pages code for streaming the data back to the end user: https://gitlab.com/gitlab-org/gitlab-pages/-/blob/64f914a804a4da8a521c5cbe7df1b8cb73f45a4f/internal/artifact/artifact.go#L123-126
Is this Bad?
Subjective; but it is potentially inefficient. GitLab.com segregates its infrastructure into logical blocks. The WEB
deployment manages most frontend work, the API
handles anything routed to /api/v4
, and GitLab Pages operates on it's own fleet with its own external facing IP address. The fact that we ask for a file across all three deployments just seems strange. A further deep dive can validate how many database calls this may make and how often each component is asking for the appropriate permissions. The worse part I think we are doing, is Pages is not serving the data. The API
is. Pages simply copies the data from the API
to the client. This means we move (using the example above) 382KB of data from API
to Pages
and then onward to the client. Keep in mind, the API
needs to retrieve this out of Object storage.
We limit which file types we do this for: https://gitlab.com/gitlab-org/gitlab/-/blob/a149b49de13147cf825ef4c45d08d3f27c63add6/app/models/ci/artifact_blob.rb#L7.
In doing a very obtuse log view, in a 3 hour period during peak load times, we perform this workflow upwards of 11k times. Reference: https://log.gprd.gitlab.net/goto/2c347e099b44059887dc6f9ca39b1b2d
Should we improve this?
I would immediately vote yes, but a further dive into the operation + validation that I've not made mistakes in the above description would be beneficial.
According to workhorse
documentation, the fact that this is a static file would indicate workhorse
should be the responsible party of this request. If this is true, why are we forwarding the request into rails? We could simplify all of this by having workhorse
do what we have it documented as accomplishing. If rails is the responsible party, why redirect the user to pages? We traverse our WEB
deployment, and then later the same user will come back and make the essentially the same call, to the exact same object via the API
deployment. Could this be improved by having the GitLab web UI make the appropriate api call instead? If this is possible, we should see a mild improvement on the UI responsiveness.
Because pages is not behind any sort of CDN, every single one of these requests will always be expensive. We also set the cache-control
header to no-cache
on both the returned HTTP 302 as well as the actual data, so even if we were behind a CDN, there would be no caching. For any repetitive calls, we'll use up network bandwidth and cloud costs associated with both of these calls.
Hypothetically, if we minimally cached the HTTP 302, we'd see a few milliseconds of performance improvement for future client hits as this first request does sit behind a CDN; this being the WEB
deployment running gitlab.com
. More cost and performance savings would be seen if a CDN sat in front of the Pages endpoint as this is where the bulk of the data is being returned. This caching does not address the problem of us copying the data between the API and the Pages service. I'm not sure how to appropriately evaluate the networking costs associated inside of our cloud provider to determine how much this impacts overall infra costs. The costs would also be driven more greatly by user behavior, such as retrieving these files in a repetitive nature.
It should be noted that CI
utilizes the API
. Therefore, Pages is not used in this situation. However, we advertise the non-api URL for retrieving artifacts, therefore, SOME requests (determined by file type) would go through this lengthy workflow. Reference: https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#access-the-latest-job-artifacts-by-url
Other Notables
-
I made the initial request using HTTP/2, but in workhorse we logged HTTP/1.1
🤔 -
We assume that all requests that reach Workhorse pass through an upstream proxy such as NGINX or Apache first.
This is no longer the case for a large chunk of our .com infrastructure as we've moved to rid of NGINX.
🤔 -
Pages does not appear to buffer this data which leads me to believe this COULD be error prone. Clarification on this would be beneficial. Knowing this may help us better understand the memory usage of this service if we are tossing very large files around.
References:
- Moving Pages to Kubernetes: gitlab-com/gl-infra&273 (closed), and how we landed on this investigation: gitlab-com/gl-infra/delivery#1969 (closed)
- Introduction of this feature to Pages: gitlab-pages#78 (closed)
- Discussion leading to this new feature: gitlab-foss#34102 (closed)
cc'ing a few potentially interested parties: @WarheadsSE @jarv @jaime @nick.thomas @ayufan @davis_townsend