Ad-hoc clean up of broken trace chunks for blocked/deactivated users

Production Change

Change Summary

In gitlab-org/gitlab#330141 (closed) we discovered there are many CI Build Trace Chunks (chunks) in the database pointing to Redis keys that have expired. This is contributing to preventing Ci::ArchiveTracesCronWorker from functioning (it times out in SQL doing the initial collation of what builds need to be archived). Subsequently in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342 we decided that we should declare bankruptcy on those traces/jobs and delete the traces entirely. Initially we are going to do this only for blocked/deactivated users, as these are likely to have been blocked for CI abuse. Others will be dealt with separately, if the archive job cannot successfully run after we do this.

This change is a production data change that will delete records from the database. Please refer to the linked issues above for a full discussion of the nature of this data and derivation of the Rails code that we will execute (particularly https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342).

C2 under the "Any procedural invocation such as a SQL script, a ruby script module, a rake task which is performed on a production console server, either using gitlab-rails or gitlab-rake" example/criteria

Change Details

Services Impacted - ServiceSidekiq ServicePostgres
Change Technician - @cmiskell
Change Criticality - C2
Change Type - changescheduled
Change Reviewer - @nhoppe1
Due Date - 2021-05-20 01:30 UTC
Time tracking - 30 minutes
Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes In the production rails console:

Ci::BuildTraceChunk.in_batches(of: 100) do |traces_batch|
  Ci::Build.with_stale_live_trace.id_in(traces_batch.select(:build_id)).includes(:user).each do |build|
    if build.user.nil? || build.user.blocked? || build.user.ghost?
      puts "Destroying trace chunks for build #{build.id}"
      build.trace.erase_trace_chunks!
    else
      puts "Skipping destruction for #{build.id} because the user #{build.user.id} is active"
    end
  end
end

In addition to the console output, this will be logged to /var/log/rails_sessions/DATESTAMP-USERNAME-rails.log which may be more convenient if necessary for any post processing/rollback.

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Up to 1 hour

Wait for the Ci::ArchiveTracesCronWorker to run at 17 minutes past the next hour and see if it succeeds or fails. If it fails with a query timeout, check Sentry to confirm where.
- Logs: https://log.gprd.gitlab.net/goto/6c44c0f49b351bc7bb51a5d8d3095d21
- Sentry: current known failure/timeout is at line 17 of archive_traces_cron_worker.rb, visible at of https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+ArchiveTracesCronWorker (not sure why we have many entries here, but I'm guessing the precise details of the stacktrace has varied over time).
  - If the timeout is the same, we haven't cleaned up enough and need to reconsider the traces for active users and/or look further at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13348. The status of this should be reported on gitlab-org/gitlab#330141 (closed) and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13342

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Extract the deleted rows (by build_id) from the delayed replica postgres database which is 8 hours behind, or a ZFS snapshot provisioned on patroni-v12-zfs-01-db-gprd.c.gitlab-production.internal.

In an absolute worst case we could reconstruct the available data from object storage and the build id. The trace chunk database row contains, other than a PK:

build_id: Logged by the script
chunk_index: 0..N, one per chunk
data_store: 3=> object storage, 1=> redis
raw_data: null if the data is in object storage or redis (always in our case)
checksum: CRC32 checksum of the file contents if in object storage
lock_version: 1 if there is a checksum, 0 otherwise
build_id_convert_to_bigint: the build_id, but as a bigint. Mostly still just 0 and only set to the new value on insert for a trigger. Probably just set to 0 for legacy rows; we can always clean this up later.

So we could iterate over the logged build ids, check what chunks exist in object storage (by index) and reconstruct the rows from there (including checksum). The data is in gs://gitlab-gprd-artifacts/tmp/builds/BUILD_ID/chunks/CHUNK_INDEX.log. This can be done at any time, without any limit (we're not planning on deleting the objectstore data).

The sequence of integer indexes must be complete from 0..N with no gaps for the Chunked IO code to work, so we'd also need to backfill any that are missing with data_store 1

Monitoring

Key metrics to observe

Nothing traditional is available for this. At best, perhaps running select count(*) from ci_build_trace_chunks might be interesting, but of limited value

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 20, 2021 by Craig Miskell