Skip to content

WIP: Introduce simple ActiveRecord-based bulk-insert functionality

Matthias Käppler requested to merge 196844-bulk-insert-associations into master

NOTE: I am closing this in favor or two smaller MRs:

What does this MR do?

Adds support for bulk-inserting associations safely.

References: #196844 (closed)

New bulk insertion API

Bulk insertions are crucial for storing large amounts of data efficiently. However, we also identified the need for this to happen in a safe manner, i.e. by ensuring bulk insertions are only available when we can have certain guarantees that we are not causing integrity problems or violate business rules (often encoded in ActiveRecord validations.)

This MR extends on !24168 (merged) in the following ways:

BulkInsertSafe.[bulk_insert|bulk_insert!]

These two new methods operate on sequences of ActiveRecord objects. They behave similarly to save and save! in the sense that they run validations and either return a boolean indicating success or raise an exception. This ensures that we won't be writing data which would not pass if they were instead inserted via save or similar built-ins.

Internally these calls rely on ActiveRecord 6's new InsertAll type, which inserts hashes in bulk, but does not run validations. This and the fact that validations are run are the primary differences to the existing Database.bulk_insert helper.

Note that as of !24168 (merged) you can only access this functionality if (as the name suggests) your target model type is considered "safe for bulk insertion"; these rules are currently fairly simple and prevent certain callbacks from being registered, but can be easily expanded on in the future.

Code example:

class LabelLink < ApplicationRecord
  include BulkInsertSafe
end

label_links = ... # build some label links
LabelLink.bulk_insert(label_links, batch_size: 100)

BulkInsertableAssociations: insert has_many associations in bulk

Given a type that is BulkInsertSafe, if it appears on the "owned" end of a relationship such as has_many, we can now bulk-insert these records via the owner. This is currently done using a combination of two method calls where we first schedule a set of records for bulk insertion, then flush them whenever the parent is saved:

class MergeRequestDiff
  include BulkInsertableAssociations

  has_many :merge_request_diff_commits
end

parent = MergeRequestDiff.new
diff_commits = ...

parent.try_bulk_insert_on_save(:merge_request_diff_commits, diff_commits)
...
parent.save # this will insert all pending `diff_commits` in bulk

Internally this is realized using an after_save hook. This way we can exploit transactionality of AR's callback chains. The try_bulk_insert_on_save helper actually lives on ApplicationRecord to make these inserts safer and a little less awkward, since we cannot say upfront whether a) the parent defines that method and b) the association we target is BulkInsertSafe.

Migration path

Since this new API extends on existing bulk-insert functionality in several ways, we should establish:

  • whether it can fully replace Database.bulk_insert
  • or whether it should live alongside it (considering it operates on AR instances, not row hashes)
  • or whether we should first migrate to insert_all everywhere

TODOs:

  • ensure thread-safety
  • handle validations on pending inserts
  • what happens to new items that are yet unsaved?
  • implement batching
  • documentation & better error messages with links to docs
  • implement in importer, measure results
  • insert_all vs upsert_all
  • consider using insert_all! to catch duplicate key errors
  • bulk_insert wrapper function
  • feature toggle

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

  • we plan to roll this out behind a feature flag first, where we can enable it in project imports
  • it would be interesting to test this with another existing feature, but I would require some pointers what that could be
Edited by 🤖 GitLab Bot 🤖

Merge request reports