Skip to content

Add gzip writer to CsvBuilder

Adam Hegyi requested to merge ah-add-gzip-writer-to-csv-builder into master

What does this MR do and why?

Related: #414937 (comment 1461641827)

This MR extends the CsvBuilder with a Gzip class where we can write a collection to a gzipped csv file.

Reasoning: we plan to send large volume of data to ClickHouse and instead of building large INSERT string in memory, we'll attempt to leverage their CSV FORMAT functionality: https://clickhouse.com/docs/en/integrations/data-formats/csv-tsv

The uploading will happen via an HTTP call where the CH server receives the compressed file and ingests the data.

Example usage:

scope = Issue.order(:updated_at, :id)
iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)

max_records = 10
record_count = 0

enumerator = Enumerator.new do |yielder|
  iterator.each_batch(of: 5) do |batch|
    batch.each do |row|
      yielder << row
      record_count += 1
      if record_count == max_records
        # maybe store the keyset cursor here
      end
    end
    break if record_count == max_records
  end
end

CsvBuilder::Gzip.new(enumerator, { title: -> (row) { row.title.upcase }, id: :id }).render do |tempfile|
  puts tempfile.path
  puts `zcat #{tempfile.path}`
end

The iteration will be controlled outside of the CSV library, at some point we might need to stop the processing and continue later. (this means a new csv file of course).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Adam Hegyi

Merge request reports