A new `BuildTraceChunk` row is created even though a previous chunk has not been fulfilled by 128Kb of data
Problem
Today we were troubleshooting dev.gitlab.org because of https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5048. During that investigation, we found a bug in Gitlab::Ci::Trace::ChunkedIO
that trace READ operation can be stuck in an infinite loop.
Basically, a live trace consists of multiple Ci::BuildTraceChunk
records. Each Ci::BuildTraceChunk
stores data up to 128Kb and when it's fulfilled, a new record will be created, and store the next data into that bucket.
However, it turned out that, even if a previous bucket has not been fulfilled yet, a next bucket will be created by a small chance. We've not figured out how it could happen, yet.
irb(main):046:0> Ci::Build.find(3083702).trace_chunks.order(:chunk_index)
=> #<ActiveRecord::AssociationRelation [
#<Ci::BuildTraceChunk id: 872778, build_id: 3083702, chunk_index: 0, data_store: 3, raw_data: nil>,
#<Ci::BuildTraceChunk id: 873542, build_id: 3083702, chunk_index: 1, data_store: 3, raw_data: nil>,
#<Ci::BuildTraceChunk id: 873581, build_id: 3083702, chunk_index: 2, data_store: 3, raw_data: nil>,
#<Ci::BuildTraceChunk id: 873582, build_id: 3083702, chunk_index: 3, data_store: 3, raw_data: nil>,
#<Ci::BuildTraceChunk id: 873583, build_id: 3083702, chunk_index: 4, data_store: 1, raw_data: nil>,
#<Ci::BuildTraceChunk id: 873673, build_id: 3083702, chunk_index: 5, data_store: 1, raw_data: nil>]>
Ci::Build.find(3083702).trace_chunks.order(:chunk_index).map { |tc| tc.data.bytesize }
=> [131072, 131072, 131072, 131072, 44675, 50039]
With the strange state of chunk_index: 4
, it clearly breaks calculate_size
method, because the method always assume chunk_index: 4
has been fulfilled by 131072 size already.
def calculate_size
trace_chunks.order(chunk_index: :desc).first.try(&:end_offset).to_i
end
In this situation, read
, each_line
, readline
will never reach eof?
, and stuck in infinite loop.
Incident on gitlab.com
It turned out this bug incurred data loss on gitlab.com
Affected ci_builds.id
are [89929638, 89929639, 89929640, 89930241, 90085838, 90161765]
.
Those builds have lost some chunk in traces. It's impossible to recover (unless they retry the job) by ~corrective action.
Reproduced on dev.gitlab.org
- https://dev.gitlab.org/gitlab/gitlab-ee/-/jobs/3079570
- https://dev.gitlab.org/gitlab/gitlab-ee/-/jobs/3083702
- https://dev.gitlab.org/gitlab/gitlab-ee/-/jobs/3082920