dependency proxy: failed to upload file - for images with large layers/blobs
Summary
We seem to be running into an issue that is a special or edge case of #448886 (closed). We are running GitLab Version 17.2. so we already have the fix that was merged with !149984 (merged) by @10io - still we see blobs not reaching the dependency proxy object storage under certain conditions.
We are seeing this error message in our logs:
{
"code": 499,
"correlation_id": "01J46NFB8SDK7BENKZYBDV235Z",
"error": "dependency proxy: failed to upload file",
"level": "error",
"method": "POST",
"msg": "",
"time": "2024-08-01T09:55:23Z",
"uri": ""
}
This seems to be happening only on very large images with large layers/blobs. We are able to reliably reproduce this error on our instance with the codeclimate-structure image codeclimate/codeclimate-structure:latest that is used for GitLabs code quality features. The image has a size of 2.3GB and a single layer/blob that is approx. 1.1GB. This layer is consistently failing to be uploaded into the object storage that backs the dependency proxy.
From the request logs with the same correlation_id we see, that seem to be hitting the newly implemented ´uploadRequestGracePeriod´ of 60s.
Some more Context: We are running GitLab on OVH. Although OVH is a public cloud provider the s3 performance and network bandwith may not be the same as on one of the hyper scalers.
Steps to reproduce
DISCLAIMER: This might not be possible to be reproduced on a hyper scaler s3 storage..., but several self-hosted Instances with different kinds of s3 storages might hit the same issue.
- Clear the dependency proxy cache and trigger the Sidekiq Cron Job to clean up the objects.
- Configure the code quality job to scan your repo.
- Change the images for the code quality job to be pulled through the dependency proxy.
- Job works on the first run (as the dependency proxy tees the downloaded image to the client).
- Job fails on the second run with a
uknown bloberror message, because the layer did not finish uploading to the backing s3 storage.
Example Project
Not a project, but some .gitlab-ci.yml Snippet that should help reproduce the issue:
include:
- template: Jobs/Code-Quality.gitlab-ci.yml
code_quality:
image: ${CI_DEPENDENCY_PROXY_DIRECT_GROUP_IMAGE_PREFIX}/docker:20.10.12
variables:
CODECLIMATE_PREFIX: $CI_DEPENDENCY_PROXY_DIRECT_GROUP_IMAGE_PREFIX/
CODECLIMATE_REGISTRY_USERNAME: $CI_DEPENDENCY_PROXY_USER
CODECLIMATE_REGISTRY_PASSWORD: $CI_DEPENDENCY_PROXY_PASSWORD
dependencies: []
needs: []
services:
- name: ${CI_DEPENDENCY_PROXY_DIRECT_GROUP_IMAGE_PREFIX}/docker:20-dind
alias: docker
command: [ "--tls=false" ]
tags:
- small
before_script:
- echo $CI_DEPENDENCY_PROXY_PASSWORD | docker login $CI_DEPENDENCY_PROXY_SERVER -u $CI_DEPENDENCY_PROXY_USER --password-stdin
rules:
- if: $CODE_QUALITY_DISABLED == 'true' || $CODE_QUALITY_DISABLED == '1'
when: never
- if: $CI_PIPELINE_SOURCE == 'merge_request_event'
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
What is the current bug behavior?
Dependency Proxy is not working correctly for OCI images with single large layers e.g.: codeclimate/codeclimate-structure:latest.
The upload to the backing s3 storage is killed after 60s.
What is the expected correct behavior?
The upload should continue until finished. At least the timeout should be configurable.
Results of GitLab environment info
System information
System: Ubuntu 22.04
Proxy: no
Current User: git
Using RVM: no
Ruby Version: 3.1.5p253
Gem Version: 3.5.11
Bundler Version:2.5.11
Rake Version: 13.0.6
Redis Version: 7.0.15
Sidekiq Version:7.1.6
Go Version: go1.22.5 linux/amd64
GitLab information
Version: 17.2.1-ee
Revision: 88793996279
Directory: /opt/gitlab/embedded/service/gitlab-rails
DB Adapter: PostgreSQL
DB Version: 14.11
Possible fixes
One solution I can think of would be to make the ´uploadRequestGracePeriod´ configurable.