Remove the /var/opt/gitlab/gitlab-ci/builds NFS mount on GitLab.com
Production Change
Change Summary
Provide a high-level summary of the change and its purpose.
This change will be to remove the /var/opt/gitlab/gitlab-ci/builds
NFS mount now that cloud native build logs are 100% enabled on production which has been the case since 2020-10-27.
Although the feature has been rolled out, we are still seeing NFS reads/writes for this mount, but this is expected due to temporary files that are used before uploading to object storage:
@grzesiek comments in slack
ArchiveTraceWorker is writing a temporary file to disk where we merge all the build log chunks
to rollout this change, we will execute the following steps:
/var/opt/gitlab/gitlab-ci/builds
We have confirmed that the last build log to the share was written on Oct 27th:
# ls -lt | head
total 109020
drwxr-xr-x 1 git git 5840896 Oct 27 14:56 2020_10
drwxr-xr-x 1 git git 6320128 Oct 27 14:00 2020_08
Monitoring
- Sentry errors containing "trace" keyword -> https://sentry.gitlab.net/gitlab/gitlabcom/?query=trace
- API dashboard for build status / trace operations - PUT /api/jobs/:id / PATCH /api/jobs/:id/trace Hide charts
- Build details page -> GET trace.json / GET raw Hide charts
- Redis memory -> Redis Overview Dashboard
Change Details
- Services Impacted - build logs
- Change Technician - @jarv
- Change Criticality - C3
- Change Type - changeunscheduled
- Change Reviewer - @jarv
- Due Date - 2020-11-02 14:00
- Time tracking - 2 hours
- Downtime Component - none
Detailed steps for the change
-
Unmount on Staging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4488 -
Unmount on Canary https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4489
Unmount on front-end main stage
-
Stop chef on the front-end
knife ssh 'roles:gprd-base-fe' 'sudo service chef-client stop'
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4490 -
Unmount and run reconfigure
/chatops run deploycmd chefclientreconfigure role_gprd_base_fe_web --no-check --production
/chatops run deploycmd chefclientreconfigure role_gprd_base_fe_api --no-check --production
- web - https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/2247115
- api - https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/2247117
-
Ensure chef is started on the front-end
-
knife ssh 'roles:gprd-base-fe' 'sudo service chef-client start'
Unmount on Sidekiq
-
Stop chef on sidekiq
knife ssh 'roles:gprd-base-be-sidekiq' 'sudo service chef-client stop'
-
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4491 -
Run deploy cmd to run chef and reconfigure
/chatops run deploycmd chefclientreconfigure role_gprd_base_be_sidekiq --no-check --production
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert on Staging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4488 -
Revert on Canary https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4489 -
Revert on Web main stage https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4490 -
Revert on Sidekiq https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4491
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.