Skip to content

Remove BuildTraceChunkFlush as an operational dependency of updating Build state

The Problem

What happens now

  • During https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19455+ we discovered that the process of moving Build trace data from "chunks" stored in Redis to the "persistable_store" is required for updating the state of the Build.
  • It is, technically (?), not a dependency because the the request cycle doesn't stop and wait for the BuildTraceChunk data to be persisted, it returns a 202 and moves on.
  • When that happens, the functional result is that the Runner sits, does nothing about the job, and tries again in a few more seconds.

Why this is a problem

  • The BuildTraceChunkFlushWorker runs on the catchall shard. This makes sense because moving data that we already have from one data store to another is not a urgent workload. There is no user-facing results; there is no customer value proposition in this work.
  • If there is a backup of jobs in the catchall queue, our Runner API endpoint will quickly and efficiently return 202s, update nothing, and the runner sits around retrying instead of continuing to process new CI work.
  • This is, to customers, a bunch of stuck jobs and stalled Pipelines. It is an outage.

(A) Proposal(s)

I'm labeling this a spike while we figure out what to do. This issue came directly out of corrective actions from the incident and Slack chat, so the path forward is not fully formed yet.

  1. Remove the transitioning of trace data between data stores from the workflow of CI processing. The business logic of the UpdateBuildStateService is tightly coupled with our trace data management. We should unwind that. This is vague, and I don't have a technical implementation proposal right now because it's a dense workflow. But the result we want is the decoupling.
  2. Ci::BuildTraceChunkFlushWorker should be high u... (#524599 - closed). If it's a dependency for an urgent process, it's an urgent process.
  3. ???

PoC to explore:

The basic idea is to remove the logic that prevents completion of the build when the chunks aren't flushed. Just update the build and schedule persistence. If the persistance fails we can use the retry mechanism to keep on trying to flush the chunks up until 25 times.

Schedule validation work from Ci::UpdateBuildStateService when the last chunk is flushed in Ci::BuildTraceChunkFlushWorker to be able to observe invalid traces.

The other piece that would need to be adjusted is the ArchiveTraceWorker. This runs when a build is finished and translates chunks to a full file. Since we will mark builds complete without the chunks fully flushed. We need ensure it will do this regardless of if the flushing is complete, perhaps by attempting to flush the chunks if they are not flushed. Or by re-secheduling the archive for later if they are not flushed.

# Ci::UpdateBuildStateService
    def execute
      if no_trace_dependency? #rename of !accept_available?
        return update_build_state!
      end

      ensure_pending_state!

      in_build_trace_lock do
        process_build_state!
      end
    end

    private

    def process_build_state_for_live_chunks!
      update_build_state!

      schedule_chunk_persistence_if_needed!
    end

    def schedule_chunk_persistence_if_needed!
      return unless live_chunks_pending?
  
      # Schedule chunks for persistence without blocking build completion
      build.trace_chunks.live.find_each do |chunk|
        chunk.schedule_to_persist!
      end
    end

Schedule validation work from Ci::UpdateBuildStateService when the last chunk is flushed in Ci::BuildTraceChunkFlushWorker to be able to observe invalid traces.

Edited by Allison Browne