Set GITLAB_TEMPFILE_IMMEDIATE_UNLINK env var to '1'
Production Change
Change Summary
We have seen issues with our Rails application where large incoming requests are written to disk in temporary files and not cleaned up. https://gitlab.com/gitlab-org/gitlab/-/issues/324817 contains some more details here. This is particularly common on the API fleet.
gitlab-org/gitlab!57239 (merged) makes a change to the Ruby libraries in question to immediately unlink those temporary files. This means that while they will still consume disk space while the request is being processed (or at worst, while the Puma process is running), the space can be automatically reclaimed by the OS as the file isn't persisted beyond the lifetime of the Puma process.
This change is behind an environment variable - GITLAB_TEMPFILE_IMMEDIATE_UNLINK - as we are monkey-patching Puma, and so we can't use a feature flag. (We intend to propose the change upstream if it works for us.)
Change Details
- Services Impacted - API, git, and web
- Change Technician - @msmiley / @smcgivern
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @msmiley / @smcgivern
- Due Date - 2021-04-20 19:00 UTC
- Time tracking - 120 minutes roll-out, 120 minutes rollback
- Downtime Component - None expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 mins
-
Get review and approval on https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5376 (staging VMs only) -
Get review and approval on gitlab-com/gl-infra/k8s-workloads/gitlab-com!782 (merged) -
Get review and approval on https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5358
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 90 mins
-
Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5376 (staging VMs only) -
Merge and apply gitlab-com/gl-infra/k8s-workloads/gitlab-com!782 (merged) -
Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5358
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 mins
This will involve some manual testing. The commit descriptions in gitlab-org/gitlab!57239 (merged) have some detail, but essentially we can just try sending large-ish request bodies (say, 10 MiB) and see what happens. We'll either need to monitor all API nodes or figure out how to get the right one; if we use curl then we can use its --limit-rate option to slow down the transfer.
For 'seeing what happens', we can compare when unlink or unlinkat system calls happen compared to the current state. We can also inspect /tmp directly and with a tool like inotifywatch.
One way to test is to run the following script on an API node:
test_tempfile_immediate_unlink.sh
This encapsulates the manual testing we did in the staging environment. It makes an API call while tracing temp file creation and deletion. Before the change, tempfiles will have a non-zero age. After the change, tempfiles from Puma and Rack will consistently have a zero-second age, due to the new immediate-unlink behavior.
-
In the GitLab web UI, create yourself a disposable short-lived Personal Access Token (PAT): - Create a PAT: https://gitlab.com/-/profile/personal_access_tokens
- It needs a scope of
api.
-
Run the test script as shown below from any of the API hosts: test_tempfile_immediate_unlink.sh It creates a test file to upload and makes an API call to the uploadsendpoint of the given disposable project. While running that API call, it traces file creation and deletion events, showing the age and deletion timestamp for short-lived files. Confirm that after applying this change, the temp files created by Puma and Rack have the expected 0-second age.
$ export GITLAB_API_TOKEN=<REDACTED>
$ export GITLAB_API_DOMAIN=gitlab.com
$ export PROJECT_ID=21158863
$ ./test_tempfile_immediate_unlink.sh
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 90
-
Revert and apply the revert of https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5376 (staging VMs only) -
Revert and apply the revert of gitlab-com/gl-infra/k8s-workloads/gitlab-com!782 (merged) -
Revert and apply the revert of https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5358
Monitoring
Key metrics to observe
- Metric: apdex and error rates
- Locations: https://dashboards.gitlab.net/d/web-main/api-overview?orgId=1 / https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1/ https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1
- What changes to this metric should prompt a rollback: any negative trend correlated with this change
- Metric: disk space saturation
- Locations: https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=2661375984&orgId=1 / https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=2661375984&orgId=1 (git appears to be fully-k8s?)
- What changes to this metric should prompt a rollback: none, this should go down if anything
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.