Persistent Runner Cache Upload Errors in EKS Cluster with GitLab Operator (v17.5.3)

Summary:
Despite the fix implemented in v17.5.3, customers are still encountering errors when uploading the pipeline cache to AWS S3 buckets. The issue arises in environments where runners are configured within an EKS cluster using the GitLab Operator.

Problem Statement:

  • Cache uploads fail intermittently with a 400 Bad Request error, even though the configuration appears correct.
  • This behavior disrupts pipeline execution and affects build performance due to ineffective caching.
  • The issue persists despite upgrading to v17.5.3, indicating that the fix might be incomplete or ineffective in specific environments.

Error Details:
Example of the error:

yarn/cache: found 947 matching artifact files and directories
Uploading cache.zip to https://eks-runnercache-prod.s3.eu-central-1.amazonaws.com/eks-runnercache-prod/cache/project/XXXXX/d2c-backend-dep-master
FATAL: received: 400 Bad Request

Environment Details:

  • Runners deployed within an EKS cluster.
  • GitLab Operator is used to manage multiple runners.
  • Cache upload configured to an AWS S3 bucket with the following runner configuration:
[runners.cache.s3]
      ServerAddress = "s3.eu-central-1.amazonaws.com"
      BucketName = "$RUNNERS_CACHE_S3_BUCKET_NAME"
      AccessKey = "$RUNNERS_CACHE_S3_ACCESS_KEY"
      SecretKey = "$RUNNERS_CACHE_S3_SECRET_KEY"

Proposed Solution:

  1. Investigate potential discrepancies in the cache upload implementation for EKS + GitLab Operator setups.
  2. Confirm the integrity of the fix in v17.5.3 across all scenarios, especially for multi-runner configurations.
  3. Enhance error handling to provide more specific diagnostic information when cache uploads fail.
  4. Document additional steps or configuration changes required to ensure compatibility with EKS and S3 in similar environments.

```