S3 Object Storage Artifact Upload Speed is slower than download speed
Issue Summary
Customer in https://gitlab.zendesk.com/agent/tickets/612186 (internal use only) is reporting the following discrepancy with their artifact uploads:
- Download Speed: Up to 300+ MB/s
- Upload Speed: Limited to 17-30 MB/s
- Environment: S3 object storage with 10GB local network and 5GB internet connection
-
Testing: With
rclonethey see 88.457 MiB/s when they upload the same file
Root Cause Analysis
After thorough investigation, we've identified that the speed discrepancy is due to fundamental architectural differences in how GitLab handles downloads versus uploads:
graph TD
subgraph "Download Flow (Fast: 300+ MB/s)"
A[CI Runner] -->|"Direct S3 Access via Pre-signed URL"| B[S3 Object Storage]
end
subgraph "Upload Flow (Slow: 17-30 MB/s)"
C[CI Runner] -->|"HTTP Upload"| D[GitLab Server]
D -->|"Workhorse Processing"| E[Workhorse]
E -->|"Single-threaded Upload"| F[S3 Object Storage]
end
Download Path
- Uses pre-signed URLs allowing runners to download directly from S3
- Completely bypasses the GitLab server
- Results in optimal performance (300+ MB/s)
Upload Path
- Data must flow through GitLab Workhorse before reaching S3
- This additional network hop creates a bottleneck
- Limits speeds to 17-30 MB/s regardless of network capacity
Technical Details
- This is not a runner limitation but a Workhorse architectural constraint
- Unlike cache uploads (which can now upload directly to S3), artifacts must be processed by the GitLab instance
- The issue is not specifically about multi-part uploads, though that would help
- The fundamental problem is the required Workhorse intermediary step for artifact uploads
Recommendation
From the author:
- Implementing direct-to-S3 uploads for artifacts (similar to how downloads work)
- Bypassing the GitLab server for artifact uploads when using object storage
- Supporting parallel upload streams to maximize bandwidth utilization
From Engineering: (current plan)
The Workhorse architectural constraint noted above is a significant one. The mechanics and requirements of creating new artifacts and upload them are very different from accessing and downloading them later, so we can't just mirror the download process.
This is part of the overall artifacts design, and also part of the Runner design. To move away from our current workflow would be a very significant rebuild, and require moving application logic into the Runner, which so far has none. The Runner not having application logic is an important design choice for that project as well.
We are investigating the current bottlenecks in our workflow, and will improve upload speeds to the extent we can without completely rebuilding the artifact workflow. Our goal here is going to be to verify that there's not any unexpected bottlenecks for S3, similar to the one we found below for other storages. If we can demonstrate that S3 uploads are roughly on par with uploads using similar tools given the same constraints (no multi-threading), we're going to close out this issue.
Beyond that, the level of effort in redesigning artifact upload is high enough that we should instead review customer use cases with large artifacts and start with a product-first approach to see if there are better ways to support their workflows. It's not obvious that this particular feature will be the best solution for them.