HTTP400 with no visibility when permissions on TMPDIR were modified
Summary
During incident gitlab-com/gl-infra/production#5194 (closed) it was discovered that Workhorse was failing with an HTTP400, but we have no information as to what was wrong. No events in Sentry, and no logs indicate why Workhorse was having a problem handling CI artifacts. Example snippet of an HTTP 400 error:
Key | Value |
---|---|
json.content_type | text/plain |
json.correlation_id | 01FB39J76J620B47TPYQ89Q84T |
json.duration_ms | 423 |
json.host | gitlab.com |
json.method | POST |
json.route | ^/api/v4/jobs/[0-9]+/artifacts\z |
json.status | 400 |
json.ttfb_ms | 421 |
json.type | api |
json.uri | /api/v4/jobs/[REDACTED]/artifacts?artifact_format=zip&artifact_type=archive&expire_in=1+hour |
json.user_agent | gitlab-runner 13.11.0 (13-11-stable; go1.13.8; linux/amd64) |
json.written_bytes | 38 |
kubernetes.container_image | dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-workhorse-ee:14-1-202107201616-ec13864a |
kubernetes.container_name | gitlab-workhorse |
Steps to reproduce
This problem was introduced when a change to the Pod's TMPDIR permissions were modified.
Upon startup of Workhorse, a script is run that creates the temporary directory with specific permissions: https://gitlab.com/gitlab-org/build/CNG/-/blob/47988fc90f97b5e1e9dfac54c6a5313d8af1cba3/gitlab-workhorse/scripts/start-workhorse#L9
The change that induced the outage created this directory for us, but the permissions are not modified on Pod startup as the directory would already exist:
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1037 (merged)
The failed permissions on the temporary directory are: drwxrwsrwx - root git
The working permissions on the temporary directory are: drwxrws--T - git git
Example Project
What is the current bug behavior?
HTTP 400's during CI artifact uploads; no logging telling us what actually leads to this failure
What is the expected correct behavior?
No errors during CI artifact uploads
Relevant logs and/or screenshots
See linked incident: gitlab-com/gl-infra/production#5194 (closed)
Possible fixes
It looks like rails may be very particular about directory permissions for its configured temporary directory. See issue gitlab-org/charts/gitlab#1651 (closed)