Investigate: Potential inefficiency associated with requesting artifact data from non API endpoints

A recent deep dive into the GitLab Pages feature set, specifically artifacts_server uncovered something seemingly bizarre that exposes a potential inefficiency in how our infrastructure is configured as well as how this feature is utilized. Let's look at the workflow associated with a call to the following example request:

https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html

The above curl can be repeated by simply browsing the GitLab UI at https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/browse and selecting the coverage.html file. Or running a curl -v -L.

The above calls' only purpose is to retrieve the user's golang test coverage report in HTML format created by go tool cover -html .... This example is a small file at 382KB.

tldr; user requests a file, which goes to rails, redirects to the pages service, which then goes back to rails for the same thing

A user makes a request to https://gitlab.com/gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html which is sent to our WEB deployment.
Workhorse receives this request and forward it onto Rails
The Projects::ArtifactsController does a few checks, primarily around validating whether or not this should be serviced by the Pages feature and if yes, sends the user an HTTP 302 and location.
The user will then follow this redirect to https://gitlab-org.gitlab.io/-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html
Pages responds by reaching back to Rails, handled by our API deployment for the same object the user requested in step 1.
The data is then sent to the user (potentially unsafely; more on this below)

sequenceDiagram
    participant C as Client
    participant WH as Workhorse
    participant R as Rails
    participant P as Pages

    Note over C,WH: gitlab.com
    C->>+WH: GET /gitlab-org/gitlab-pages/-/jobs/1533612483/artifacts/file/coverage.html HTTP/2
    WH->>+R: request is forwarded
    R->>R: Projects::ArtifactsController [WEB]
    R->>-WH: HTTP 302 - https://gitlab-org.gitlab.io/-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html
    WH->>-C: response is forwarded

    Note over C,P: *.gitlab.io
    C->>+P: GET /-/gitlab-pages/-/jobs/1533612483/artifacts/coverage.html HTTP/2
    P->>+R: get object [API]
    R->>-P: return object
    P->>-C: stream object

❕ I excluded the Cloud providers' Load Balancers and our HAProxy frontends for brevity. ❕

Rails code responsible for generating an HTTP 302: https://gitlab.com/gitlab-org/gitlab/-/blob/a149b49de13147cf825ef4c45d08d3f27c63add6/app/models/ci/artifact_blob.rb#L52-57
Pages code for streaming the data back to the end user: https://gitlab.com/gitlab-org/gitlab-pages/-/blob/64f914a804a4da8a521c5cbe7df1b8cb73f45a4f/internal/artifact/artifact.go#L123-126

Is this Bad?

Subjective; but it is potentially inefficient. GitLab.com segregates its infrastructure into logical blocks. The WEB deployment manages most frontend work, the API handles anything routed to /api/v4, and GitLab Pages operates on it's own fleet with its own external facing IP address. The fact that we ask for a file across all three deployments just seems strange. A further deep dive can validate how many database calls this may make and how often each component is asking for the appropriate permissions. The worse part I think we are doing, is Pages is not serving the data. The API is. Pages simply copies the data from the API to the client. This means we move (using the example above) 382KB of data from API to Pages and then onward to the client. Keep in mind, the API needs to retrieve this out of Object storage.

We limit which file types we do this for: https://gitlab.com/gitlab-org/gitlab/-/blob/a149b49de13147cf825ef4c45d08d3f27c63add6/app/models/ci/artifact_blob.rb#L7.

In doing a very obtuse log view, in a 3 hour period during peak load times, we perform this workflow upwards of 11k times. Reference: https://log.gprd.gitlab.net/goto/2c347e099b44059887dc6f9ca39b1b2d

Should we improve this?

I would immediately vote yes, but a further dive into the operation + validation that I've not made mistakes in the above description would be beneficial.

According to workhorse documentation, the fact that this is a static file would indicate workhorse should be the responsible party of this request. If this is true, why are we forwarding the request into rails? We could simplify all of this by having workhorse do what we have it documented as accomplishing. If rails is the responsible party, why redirect the user to pages? We traverse our WEB deployment, and then later the same user will come back and make the essentially the same call, to the exact same object via the API deployment. Could this be improved by having the GitLab web UI make the appropriate api call instead? If this is possible, we should see a mild improvement on the UI responsiveness.

Because pages is not behind any sort of CDN, every single one of these requests will always be expensive. We also set the cache-control header to no-cache on both the returned HTTP 302 as well as the actual data, so even if we were behind a CDN, there would be no caching. For any repetitive calls, we'll use up network bandwidth and cloud costs associated with both of these calls. Hypothetically, if we minimally cached the HTTP 302, we'd see a few milliseconds of performance improvement for future client hits as this first request does sit behind a CDN; this being the WEB deployment running gitlab.com. More cost and performance savings would be seen if a CDN sat in front of the Pages endpoint as this is where the bulk of the data is being returned. This caching does not address the problem of us copying the data between the API and the Pages service. I'm not sure how to appropriately evaluate the networking costs associated inside of our cloud provider to determine how much this impacts overall infra costs. The costs would also be driven more greatly by user behavior, such as retrieving these files in a repetitive nature.

It should be noted that CI utilizes the API. Therefore, Pages is not used in this situation. However, we advertise the non-api URL for retrieving artifacts, therefore, SOME requests (determined by file type) would go through this lengthy workflow. Reference: https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#access-the-latest-job-artifacts-by-url

Other Notables

I made the initial request using HTTP/2, but in workhorse we logged HTTP/1.1 🤔
Documented in workhorse:

We assume that all requests that reach Workhorse pass through an upstream proxy such as NGINX or Apache first.

This is no longer the case for a large chunk of our .com infrastructure as we've moved to rid of NGINX. 🤔
Pages does not appear to buffer this data which leads me to believe this COULD be error prone. Clarification on this would be beneficial. Knowing this may help us better understand the memory usage of this service if we are tossing very large files around.

References:

Moving Pages to Kubernetes: gitlab-com/gl-infra&273 (closed), and how we landed on this investigation: gitlab-com/gl-infra/delivery#1969 (closed)
Introduction of this feature to Pages: gitlab-pages#78 (closed)
Discussion leading to this new feature: gitlab-foss#34102 (closed)

cc'ing a few potentially interested parties: @WarheadsSE @jarv @jaime @nick.thomas @ayufan @davis_townsend