Skip to content

WIP: POC for bulk-insert API

Matthias Käppler requested to merge 36992-bulk-insert-api-poc into master

What does this MR do?

This is not fully functional yet.

As per #36992 (closed) we're looking for a cross-cutting way to introduce a bulk-insert API, such that INSERTs will be batched up for saves rather than executed 1 by 1. We already have raw SQL-level bulk-insert functionality via Gitlab::Database::bulk_insert but it operates on the row level and does not trigger ActiveRecord callbacks.

What we aim for here is to provide full bulk-insert functionality in ActiveRecord, including callbacks.

The current API is very simple:

MyEntity.save_all!(instances)

This will trigger 1 INSERT instead of instances.size inserts, and all callbacks will fire transactionally.

Example program:

Model = ImportFailure

Model.delete_all

class Model < ApplicationRecord
  before_save -> { p "model::before_save" }
  after_save -> {
    p "model::after_save"
    # create something transactionally
    OauthAccessToken.create!
  }
  after_commit -> {
    p "model::after_commit"
  }
end

proj = Project.last
m1 = Model.new(project: proj)
m2 = Model.new(project: proj)

Model.save_all!(m1, m2)

SQL log:

(0.2ms)  BEGIN
  ↳ lib/gitlab/database/bulk_ops/bulk_insert_support.rb:35
  Project Load (1.4ms)  SELECT  "projects".* FROM "projects" ORDER BY "projects"."id" DESC LIMIT $1  [["LIMIT", 1]]
  ↳ i@ar-batching.rb:29
  ImportFailure Load (0.4ms)  SELECT  "import_failures".* FROM "import_failures" ORDER BY "import_failures"."id" DESC LIMIT $1  [["LIMIT", 2]]
  ↳ i@ar-batching.rb:39
   (0.3ms)  SELECT VERSION()
  ↳ lib/gitlab/database.rb:239
   (0.4ms)          INSERT INTO import_failures ("project_id", "created_at")
        VALUES (66, '2019-12-12 16:43:35.245218'), (66, '2019-12-12 16:43:35.245840')
 RETURNING id
  ↳ lib/gitlab/database.rb:191
  Doorkeeper::AccessToken Exists (0.6ms)  SELECT  1 AS one FROM "oauth_access_tokens" WHERE "oauth_access_tokens"."token" = $1 LIMIT $2  [["token", "23fed6b958cb9b3dd196c68a9e2d40c39b5271e5080a06c804752a3e0ba9c92d"], ["LIMIT", 1]]
  ↳ i@ar-batching.rb:21
  OauthAccessToken Create (0.3ms)  INSERT INTO "oauth_access_tokens" ("token", "created_at") VALUES ($1, $2) RETURNING "id"  [["token", "23fed6b958cb9b3dd196c68a9e2d40c39b5271e5080a06c804752a3e0ba9c92d"], ["created_at", "2019-12-12 16:43:35.260848"]]
  ↳ i@ar-batching.rb:21
  Doorkeeper::AccessToken Exists (0.3ms)  SELECT  1 AS one FROM "oauth_access_tokens" WHERE "oauth_access_tokens"."token" = $1 LIMIT $2  [["token", "7bd111d90906bb536456e114db3d8ba256bc3214ec90cd89d367ad494d9fcdf3"], ["LIMIT", 1]]
  ↳ i@ar-batching.rb:21
  OauthAccessToken Create (0.2ms)  INSERT INTO "oauth_access_tokens" ("token", "created_at") VALUES ($1, $2) RETURNING "id"  [["token", "7bd111d90906bb536456e114db3d8ba256bc3214ec90cd89d367ad494d9fcdf3"], ["created_at", "2019-12-12 16:43:35.264559"]]
  ↳ i@ar-batching.rb:21
   (1.1ms)  COMMIT
  ↳ lib/gitlab/database/bulk_ops/bulk_insert_support.rb:35

It accomplishes this by hooking into two ActiveRecord methods

  • _create_record (instance method)
  • _insert_record (class method)

and rewriting the model's callback chain so that it will fire as follows:

  1. save_all!(items)
  2. save callback chain
  3. remove after_* hooks for all items
  4. call before_* hooks for all items
  5. insert_all items
  6. call after_* hooks for all items
  7. restore callback chain

However, there are a number of challenges to overcome, not all of which are solved yet:

  • Ensure transactionality of operations. AR provides ACID semantics for single saves, including anything that happens in callbacks. This is already solved in this POC by providing an outer transaction boundary in which everything executes. Anything that was previously transactional on an individual level will now be a SAVEPOINT. I already did some testing here to make sure that e.g. an exception raised in a callback will ROLLBACK the entire work.
  • Ensure thread safety. This should be solved. I started out using only thread-local state, but then realized that because we rewrite callback chains at the class level, this is not sufficient. I ended up bluntly using a transaction-wide lock. This means that all bulk inserts for the same model class are now fully serialized, i.e. we cannot exploit concurrency while one thread is waiting for postgres to start another bulk insert on the same model type. From my testing I found that even without the Mutex, the transaction is fully serialized between calls, but I might have just witnessed the GVL in action.
  • after_save ordering. With bulk inserts, the single INSERT happens after the after_save hook fires on model instances, so whenever we try to access the data in this hook, it will fail. I provided a hacky solution to this that involves first unhinging and later replacing callback chains.
  • Inject IDs back into instances. Because model IDs are not available until after all rows are inserted, we need to inject them after the fact. This is now solved.
  • Support after/before_create hooks. Done.
  • Support around_* hooks? I realized that these are poorly documented in Rails and they seemed to be defined differently than other hooks (since they must yield to the remaining callback chain.) I also found 0 uses of around_(save|create) in our code base, so this might be a low prio for now.
  • Support entity relations. I believe this "just works". I only tested it with a belongs_to relation in the batched model. AR handles these via special before filters that save the association along with the owning model instance, so unless we mess with the before filter chain (which we don't currently), anything happening there should be invoked as part of the outer transaction, which is what we want.
  • Limit batch size. Currently, there is no limit to how many new items you can pass to save_all. We might queue up so many inserts here, that we will ultimately either break the statement size allowed by postgres or it becoming very slow to execute due to its size. We should instead pass a batch size and then perform INSERTs in groups of batch_size.
  • Use library to perform bulk inserts at SQL level. I ended up backporting the new insert_all API from ActiveRecord 6. This worked a lot better than our internal helper, and it will make for a smooth transition once we're on Rails 6.
  • Consider supporting UPDATEs. The POC only covers INSERTs so far. Internally, however, the paths ActiveRecord takes for inserts and updates are quite similar. We should look for opportunities to generalize this to any form of save.
  • Write tests. Added a fairly extensive test suite and will continue to improve it.
  • Ensure isolation of bulk-insert from normal operation. Since we interfere with AR hooks and per-item functionality when mixing in the BulkInsertSupport module, we need to verify that per-item operations like the ordinary save remain intact when called outside of save_all. For instance, I'm not sure what would happen in this case: 1) thread A calls save_all and blocks on I/O and yields to thread B; 2) thread B calls save. At that point, the class after_* callbacks have been rewritten to run delayed, but we don't want that for individual saves! The individual save will also not run in the same transaction because a) bulk insert state is thread-local and b) well, it's already being committed. Maybe for now we should fail fast and raise as soon as we detect that the Mutex is held (this required the lock to be re-entrant?)
  • Verify use in import/export. Making our imports faster was the primary goal. Apply bulk inserts to an import and make sure it works. Hopefully it'll be faster too.
  • Code cleanup & speed-up. Once everything else is working, we should look into making the implementation as clean and fast as we can.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

TODO

Edited by 🤖 GitLab Bot 🤖

Merge request reports