Skip to content

A new `BuildTraceChunk` row is created even though a previous chunk has not been fulfilled by 128Kb of data

Problem

Today we were troubleshooting dev.gitlab.org because of https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5048. During that investigation, we found a bug in Gitlab::Ci::Trace::ChunkedIO that trace READ operation can be stuck in an infinite loop.

Basically, a live trace consists of multiple Ci::BuildTraceChunk records. Each Ci::BuildTraceChunk stores data up to 128Kb and when it's fulfilled, a new record will be created, and store the next data into that bucket.

However, it turned out that, even if a previous bucket has not been fulfilled yet, a next bucket will be created by a small chance. We've not figured out how it could happen, yet.

irb(main):046:0> Ci::Build.find(3083702).trace_chunks.order(:chunk_index)
=> #<ActiveRecord::AssociationRelation [
    #<Ci::BuildTraceChunk id: 872778, build_id: 3083702, chunk_index: 0, data_store: 3, raw_data: nil>, 
    #<Ci::BuildTraceChunk id: 873542, build_id: 3083702, chunk_index: 1, data_store: 3, raw_data: nil>, 
    #<Ci::BuildTraceChunk id: 873581, build_id: 3083702, chunk_index: 2, data_store: 3, raw_data: nil>, 
    #<Ci::BuildTraceChunk id: 873582, build_id: 3083702, chunk_index: 3, data_store: 3, raw_data: nil>, 
    #<Ci::BuildTraceChunk id: 873583, build_id: 3083702, chunk_index: 4, data_store: 1, raw_data: nil>, 
    #<Ci::BuildTraceChunk id: 873673, build_id: 3083702, chunk_index: 5, data_store: 1, raw_data: nil>]>

Ci::Build.find(3083702).trace_chunks.order(:chunk_index).map { |tc| tc.data.bytesize }
=> [131072, 131072, 131072, 131072, 44675, 50039]

With the strange state of chunk_index: 4, it clearly breaks calculate_size method, because the method always assume chunk_index: 4 has been fulfilled by 131072 size already.

        def calculate_size
          trace_chunks.order(chunk_index: :desc).first.try(&:end_offset).to_i
        end

In this situation, read, each_line, readline will never reach eof?, and stuck in infinite loop.

Incident on gitlab.com

It turned out this bug incurred data loss on gitlab.com

Affected ci_builds.id are [89929638, 89929639, 89929640, 89930241, 90085838, 90161765].

Those builds have lost some chunk in traces. It's impossible to recover (unless they retry the job) by ~corrective action.

Reproduced on dev.gitlab.org

Related

/cc @stanhu @ayufan

Edited by Shinya Maeda