Ad-hoc clean up of broken trace chunks for blocked/deactivated users
Production Change
Change Summary
In gitlab-org/gitlab#330141 (closed) we discovered there are many CI Build Trace Chunks (chunks) in the database pointing to Redis keys that have expired. This is contributing to preventing Ci::ArchiveTracesCronWorker from functioning (it times out in SQL doing the initial collation of what builds need to be archived). Subsequently in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342 we decided that we should declare bankruptcy on those traces/jobs and delete the traces entirely. Initially we are going to do this only for blocked/deactivated users, as these are likely to have been blocked for CI abuse. Others will be dealt with separately, if the archive job cannot successfully run after we do this.
This change is a production data change that will delete records from the database. Please refer to the linked issues above for a full discussion of the nature of this data and derivation of the Rails code that we will execute (particularly https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342).
C2 under the "Any procedural invocation such as a SQL script, a ruby script module, a rake task which is performed on a production console server, either using gitlab-rails or gitlab-rake" example/criteria
Change Details
- Services Impacted - ServiceSidekiq ServicePostgres
- Change Technician - @cmiskell
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @nhoppe1
- Due Date - 2021-05-20 01:30 UTC
- Time tracking - 30 minutes
- Downtime Component - None
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes In the production rails console:
Ci::BuildTraceChunk.in_batches(of: 100) do |traces_batch|
Ci::Build.with_stale_live_trace.id_in(traces_batch.select(:build_id)).includes(:user).each do |build|
if build.user.nil? || build.user.blocked? || build.user.ghost?
puts "Destroying trace chunks for build #{build.id}"
build.trace.erase_trace_chunks!
else
puts "Skipping destruction for #{build.id} because the user #{build.user.id} is active"
end
end
end
In addition to the console output, this will be logged to /var/log/rails_sessions/DATESTAMP-USERNAME-rails.log which may be more convenient if necessary for any post processing/rollback.
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Up to 1 hour
-
Wait for the Ci::ArchiveTracesCronWorker to run at 17 minutes past the next hour and see if it succeeds or fails. If it fails with a query timeout, check Sentry to confirm where. - Logs: https://log.gprd.gitlab.net/goto/6c44c0f49b351bc7bb51a5d8d3095d21
- Sentry: current known failure/timeout is at line 17 of archive_traces_cron_worker.rb, visible at of https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+ArchiveTracesCronWorker (not sure why we have many entries here, but I'm guessing the precise details of the stacktrace has varied over time).
- If the timeout is the same, we haven't cleaned up enough and need to reconsider the traces for active users and/or look further at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13348. The status of this should be reported on gitlab-org/gitlab#330141 (closed) and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Extract the deleted rows (by build_id) from the delayed replica postgres database which is 8 hours behind, or a ZFS snapshot provisioned on patroni-v12-zfs-01-db-gprd.c.gitlab-production.internal.
In an absolute worst case we could reconstruct the available data from object storage and the build id. The trace chunk database row contains, other than a PK:
- build_id: Logged by the script
- chunk_index: 0..N, one per chunk
- data_store: 3=> object storage, 1=> redis
- raw_data: null if the data is in object storage or redis (always in our case)
- checksum: CRC32 checksum of the file contents if in object storage
- lock_version: 1 if there is a checksum, 0 otherwise
- build_id_convert_to_bigint: the build_id, but as a bigint. Mostly still just 0 and only set to the new value on insert for a trigger. Probably just set to 0 for legacy rows; we can always clean this up later.
So we could iterate over the logged build ids, check what chunks exist in object storage (by index) and reconstruct the rows from there (including checksum). The data is in gs://gitlab-gprd-artifacts/tmp/builds/BUILD_ID/chunks/CHUNK_INDEX.log. This can be done at any time, without any limit (we're not planning on deleting the objectstore data).
The sequence of integer indexes must be complete from 0..N with no gaps for the Chunked IO code to work, so we'd also need to backfill any that are missing with data_store 1
Monitoring
Key metrics to observe
Nothing traditional is available for this. At best, perhaps running select count(*) from ci_build_trace_chunks might be interesting, but of limited value
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.