fix: prevent overlimit errors with bulk size tracking

Relates to issue #179 (closed)

This MR was created by Duo in Session 1217497

I believe this fixes a gap between the olivere library calculation of the indexed document size by including the overhead for the bulk operations that isn't included in the library's bulk processor checks.

AI Summary

This merge request adds smart bulk size management to the GitLab Elasticsearch indexer to prevent "Request Entity Too Large" errors that commonly occur with AWS OpenSearch services.

The main changes include:

New Features:

  • Automatic size tracking: The system now monitors how much data is being prepared for each batch request to Elasticsearch
  • Proactive flushing: When adding a new document would make the batch too large, it automatically sends the current batch first, then starts a new one
  • Debug mode: Added an environment variable ELASTIC_DEBUG to help developers see what requests are being sent to Elasticsearch

Technical Improvements:

  • Thread safety: Added proper locking to ensure the size tracking works correctly when multiple operations happen simultaneously
  • Size calculation: The system estimates document sizes by converting them to JSON and accounting for Elasticsearch metadata overhead
  • Configuration: Added max_bulk_size_bytes setting with a default of 10MB, but AWS users should set it to 9MB to account for request overhead

Benefits:

  • Prevents failed requests due to size limits, especially important for AWS OpenSearch which has strict 10MB limits
  • Reduces retry overhead and improves performance
  • Works automatically without requiring configuration changes
  • Includes comprehensive tests to ensure reliability

The changes are backward compatible and work transparently with existing bulk processing settings.

How to test this

Tested in gitlab-org/gitlab project: gitlab!211127 (closed)

Testing this was SUCH A PAIN.

  1. edit the elasticsearch config for max bulk size to a lower value of 4096 file: GDK_DIR/elasticsearch/config/elasticsearch.yml
    # ======================== Elasticsearch Configuration =========================
    
    http.max_content_length: 4096b
  2. verify the setting:
    curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.http.max_content_length" | jq
  3. create a project with a few files, these are the sizes i used
    ~/Developer/scratch/indexer_bulk_limit_tests/my-exceeding-project on 🚀 main on ☁️
    ➜ ls -la
    total 24
    drwxr-xr-x@  6 terrichu  staff   192 Nov  3 15:45 .
    drwxr-xr-x@  8 terrichu  staff   256 Nov  3 15:22 ..
    drwxr-xr-x@ 12 terrichu  staff   384 Nov  3 15:46 .git
    -rw-r--r--@  1 terrichu  staff  1300 Nov  3 15:44 file_1300_bytes_1.txt
    -rw-r--r--@  1 terrichu  staff  1300 Nov  3 15:44 file_1300_bytes_2.txt
    -rw-r--r--@  1 terrichu  staff   200 Nov  3 15:44 file_200_bytes.txt
  4. apply this patch. i had to disable the workers and bulk settings for the ES client to get it to actually go over. I think this is an edge case that really needs more files to be seen in production
diff --git a/internal/mode/advanced/elastic/client.go b/internal/mode/advanced/elastic/client.go
index 8c22a7f..ae2910c 100644
--- a/internal/mode/advanced/elastic/client.go
+++ b/internal/mode/advanced/elastic/client.go
@@ -180,7 +180,6 @@ func NewClient(config *Config, correlationID string) (*Client, error) {
 		GroupID:               config.GroupID,
 		Permissions:           config.Permissions,
 		PermissionsWiki:       config.PermissionsWiki,
-		maxBulkSize:           config.MaxBulkSize,
 		traversalIDs:          config.TraversalIDs,
 		Client:                client,
 		hashedRootNamespaceId: config.HashedRootNamespaceId,
@@ -191,8 +190,8 @@ func NewClient(config *Config, correlationID string) (*Client, error) {
 	}
 
 	bulk, err := client.BulkProcessor().
-		Workers(config.BulkWorkers).
-		BulkSize(config.MaxBulkSize).
+		//Workers(config.BulkWorkers).
+		//BulkSize(config.MaxBulkSize).
 		After(wrappedClient.afterCallback).
 		Do(context.Background())
 
  1. Use rails console to figure out the command used to run the indexer for your project manually
diff --git a/ee/lib/gitlab/elastic/indexer.rb b/ee/lib/gitlab/elastic/indexer.rb
index bc81119a2c10..a5352aa87c09 100644
--- a/ee/lib/gitlab/elastic/indexer.rb
+++ b/ee/lib/gitlab/elastic/indexer.rb
@@ -101,6 +101,7 @@ def repository
       def run_indexer!(base_sha, to_sha, target)
         vars = build_envvars(target)
         command = build_command(base_sha, to_sha)
+        binding.pry
 
         output, status = Gitlab::Popen.popen(command, nil, vars)
 
Search::Elastic::CommitIndexerWorker.new.perform(24, { 'force' => "true" })
# breakpoint run these
# you will need to add some `"` and possibly escape `\"` some others
env_vars = vars.map { |k,v| "#{k}=#{v}" }.join(' ')
actual_command = command.join(' ')
  1. grab the env variables + the command
  2. make sure you run the command for your local indexer (not the GDK one)
  3. check out main branch, run make build, run the command for the project indexing. It should fail
  4. check out this branch, run make build, run the command for the project indexing. It should pass

Before

➜ DEBUG=true ELASTIC_DEBUG=true RAILS_ENV=development ELASTIC_CONNECTION_INFO="{\"url\":[\"http://localhost:9200/\"],\"aws\":false,\"aws_access_key\":\"\",\"aws_region\":\"us-east-1\",\"aws_role_arn\":\"\",\"client_adapter\":\"typhoeus\",\"max_bulk_size_bytes\":4096,\"max_bulk_concurrency\":10,\"index_name\":\"gitlab-development\",\"index_name_commits\":\"gitlab-development-commits\",\"index_name_wikis\":\"gitlab-development-wikis\"}" GITALY_CONNECTION_INFO="{\"storage\":\"default\",\"limit_file_size\":1048576,\"address\":\"unix:/Users/terrichu/Developer/gdk/praefect.socket\",\"token\":null}" CORRELATION_ID= SSL_CERT_FILE=/opt/homebrew/etc/openssl@3/cert.pem SSL_CERT_DIR=/opt/homebrew/etc/openssl@3/certs bin/gitlab-elasticsearch-indexer --timeout=1800s --visibility-level=0 --group-id=98 --project-id=24 --from-sha=4b825dc642cb6eb9a060e54bf8d69288fbee4904 --to-sha=e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 --full-path=test-indexer-bulk-limits/my-exceeding-project --repository-access-level=20 --hashed-root-namespace-id=116 --schema-version-blob=2308 --schema-version-commit=2306 --archived=false --traversal-ids=98- @hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git

INFO[0000] Setting timeout                               timeout=30m0s
DEBU[0000] Indexing from 4b825dc642cb6eb9a060e54bf8d69288fbee4904 to e622b74d2ded298fc6c9f074ab8c93ef54b0bc03  IndexNameCommits=gitlab-development-commits IndexNameDefault=gitlab-development IndexNameWikis=gitlab-development-wikis Permissions="&{0 20}" PermissionsWiki="<nil>" archived=false blobType=blob hashedRootNamespaceId=116 projectID=24 schemaVersionBlob=2308 schemaVersionCommit=2306 schemaVersionWiki=0 skipCommits=false traversalIds=98-
ERRO[0007] Consider lowering maximum bulk request size or/and increasing http.max_content_length  bulkRequestId=1 error="elastic: Error 413 (Request Entity Too Large)" maxBulkSizeSetting=0
FATA[0007] Flushing error                                error="failed to perform all operations"

After

DEBUG=true ELASTIC_DEBUG=true RAILS_ENV=development ELASTIC_CONNECTION_INFO="{\"url\":[\"http://localhost:9200/\"],\"aws\":false,\"aws_access_key\":\"\",\"aws_region\":\"us-east-1\",\"aws_role_arn\":\"\",\"client_adapter\":\"typhoeus\",\"max_bulk_size_bytes\":4096,\"max_bulk_concurrency\":10,\"index_name\":\"gitlab-development\",\"index_name_commits\":\"gitlab-development-commits\",\"index_name_wikis\":\"gitlab-development-wikis\"}" GITALY_CONNECTION_INFO="{\"storage\":\"default\",\"limit_file_size\":1048576,\"address\":\"unix:/Users/terrichu/Developer/gdk/praefect.socket\",\"token\":null}" CORRELATION_ID= SSL_CERT_FILE=/opt/homebrew/etc/openssl@3/cert.pem SSL_CERT_DIR=/opt/homebrew/etc/openssl@3/certs bin/gitlab-elasticsearch-indexer --timeout=1800s --visibility-level=0 --group-id=98 --project-id=24 --from-sha=4b825dc642cb6eb9a060e54bf8d69288fbee4904 --to-sha=e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 --full-path=test-indexer-bulk-limits/my-exceeding-project --repository-access-level=20 --hashed-root-namespace-id=116 --schema-version-blob=2308 --schema-version-commit=2306 --archived=false --traversal-ids=98- @hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git

INFO[0000] Setting timeout                               timeout=30m0s
DEBU[0000] Indexing from 4b825dc642cb6eb9a060e54bf8d69288fbee4904 to e622b74d2ded298fc6c9f074ab8c93ef54b0bc03  IndexNameCommits=gitlab-development-commits IndexNameDefault=gitlab-development IndexNameWikis=gitlab-development-wikis Permissions="&{0 20}" PermissionsWiki="<nil>" archived=false blobType=blob hashedRootNamespaceId=116 projectID=24 schemaVersionBlob=2308 schemaVersionCommit=2306 schemaVersionWiki=0 skipCommits=false traversalIds=98-
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=0 docSize=1982 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=1982 docSize=1982 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=1982 docSize=873 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=873 docSize=679 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=679 docSize=683 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=683 docSize=683 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size  currentBatchSize=683 docSize=683 documentType=commit maxBulkSize=0
Edited by Terri Chu

Merge request reports

Loading