fix: prevent overlimit errors with bulk size tracking
Relates to issue #179 (closed)
This MR was created by Duo in Session 1217497
I believe this fixes a gap between the olivere library calculation of the indexed document size by including the overhead for the bulk operations that isn't included in the library's bulk processor checks.
AI Summary
This merge request adds smart bulk size management to the GitLab Elasticsearch indexer to prevent "Request Entity Too Large" errors that commonly occur with AWS OpenSearch services.
The main changes include:
New Features:
- Automatic size tracking: The system now monitors how much data is being prepared for each batch request to Elasticsearch
- Proactive flushing: When adding a new document would make the batch too large, it automatically sends the current batch first, then starts a new one
-
Debug mode: Added an environment variable
ELASTIC_DEBUGto help developers see what requests are being sent to Elasticsearch
Technical Improvements:
- Thread safety: Added proper locking to ensure the size tracking works correctly when multiple operations happen simultaneously
- Size calculation: The system estimates document sizes by converting them to JSON and accounting for Elasticsearch metadata overhead
-
Configuration: Added
max_bulk_size_bytessetting with a default of 10MB, but AWS users should set it to 9MB to account for request overhead
Benefits:
- Prevents failed requests due to size limits, especially important for AWS OpenSearch which has strict 10MB limits
- Reduces retry overhead and improves performance
- Works automatically without requiring configuration changes
- Includes comprehensive tests to ensure reliability
The changes are backward compatible and work transparently with existing bulk processing settings.
How to test this
Tested in gitlab-org/gitlab project: gitlab!211127 (closed)
Testing this was SUCH A PAIN.
- edit the elasticsearch config for max bulk size to a lower value of 4096
file:
GDK_DIR/elasticsearch/config/elasticsearch.yml# ======================== Elasticsearch Configuration ========================= http.max_content_length: 4096b - verify the setting:
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.http.max_content_length" | jq - create a project with a few files, these are the sizes i used
~/Developer/scratch/indexer_bulk_limit_tests/my-exceeding-project on 🚀 main on ☁️ ➜ ls -la total 24 drwxr-xr-x@ 6 terrichu staff 192 Nov 3 15:45 . drwxr-xr-x@ 8 terrichu staff 256 Nov 3 15:22 .. drwxr-xr-x@ 12 terrichu staff 384 Nov 3 15:46 .git -rw-r--r--@ 1 terrichu staff 1300 Nov 3 15:44 file_1300_bytes_1.txt -rw-r--r--@ 1 terrichu staff 1300 Nov 3 15:44 file_1300_bytes_2.txt -rw-r--r--@ 1 terrichu staff 200 Nov 3 15:44 file_200_bytes.txt - apply this patch. i had to disable the workers and bulk settings for the ES client to get it to actually go over. I think this is an edge case that really needs more files to be seen in production
diff --git a/internal/mode/advanced/elastic/client.go b/internal/mode/advanced/elastic/client.go
index 8c22a7f..ae2910c 100644
--- a/internal/mode/advanced/elastic/client.go
+++ b/internal/mode/advanced/elastic/client.go
@@ -180,7 +180,6 @@ func NewClient(config *Config, correlationID string) (*Client, error) {
GroupID: config.GroupID,
Permissions: config.Permissions,
PermissionsWiki: config.PermissionsWiki,
- maxBulkSize: config.MaxBulkSize,
traversalIDs: config.TraversalIDs,
Client: client,
hashedRootNamespaceId: config.HashedRootNamespaceId,
@@ -191,8 +190,8 @@ func NewClient(config *Config, correlationID string) (*Client, error) {
}
bulk, err := client.BulkProcessor().
- Workers(config.BulkWorkers).
- BulkSize(config.MaxBulkSize).
+ //Workers(config.BulkWorkers).
+ //BulkSize(config.MaxBulkSize).
After(wrappedClient.afterCallback).
Do(context.Background())
- Use rails console to figure out the command used to run the indexer for your project manually
diff --git a/ee/lib/gitlab/elastic/indexer.rb b/ee/lib/gitlab/elastic/indexer.rb
index bc81119a2c10..a5352aa87c09 100644
--- a/ee/lib/gitlab/elastic/indexer.rb
+++ b/ee/lib/gitlab/elastic/indexer.rb
@@ -101,6 +101,7 @@ def repository
def run_indexer!(base_sha, to_sha, target)
vars = build_envvars(target)
command = build_command(base_sha, to_sha)
+ binding.pry
output, status = Gitlab::Popen.popen(command, nil, vars)
Search::Elastic::CommitIndexerWorker.new.perform(24, { 'force' => "true" })
# breakpoint run these
# you will need to add some `"` and possibly escape `\"` some others
env_vars = vars.map { |k,v| "#{k}=#{v}" }.join(' ')
actual_command = command.join(' ')
- grab the env variables + the command
- make sure you run the command for your local indexer (not the GDK one)
- check out
mainbranch, runmake build, run the command for the project indexing. It should fail - check out this branch, run
make build, run the command for the project indexing. It should pass
Before
➜ DEBUG=true ELASTIC_DEBUG=true RAILS_ENV=development ELASTIC_CONNECTION_INFO="{\"url\":[\"http://localhost:9200/\"],\"aws\":false,\"aws_access_key\":\"\",\"aws_region\":\"us-east-1\",\"aws_role_arn\":\"\",\"client_adapter\":\"typhoeus\",\"max_bulk_size_bytes\":4096,\"max_bulk_concurrency\":10,\"index_name\":\"gitlab-development\",\"index_name_commits\":\"gitlab-development-commits\",\"index_name_wikis\":\"gitlab-development-wikis\"}" GITALY_CONNECTION_INFO="{\"storage\":\"default\",\"limit_file_size\":1048576,\"address\":\"unix:/Users/terrichu/Developer/gdk/praefect.socket\",\"token\":null}" CORRELATION_ID= SSL_CERT_FILE=/opt/homebrew/etc/openssl@3/cert.pem SSL_CERT_DIR=/opt/homebrew/etc/openssl@3/certs bin/gitlab-elasticsearch-indexer --timeout=1800s --visibility-level=0 --group-id=98 --project-id=24 --from-sha=4b825dc642cb6eb9a060e54bf8d69288fbee4904 --to-sha=e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 --full-path=test-indexer-bulk-limits/my-exceeding-project --repository-access-level=20 --hashed-root-namespace-id=116 --schema-version-blob=2308 --schema-version-commit=2306 --archived=false --traversal-ids=98- @hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git
INFO[0000] Setting timeout timeout=30m0s
DEBU[0000] Indexing from 4b825dc642cb6eb9a060e54bf8d69288fbee4904 to e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 IndexNameCommits=gitlab-development-commits IndexNameDefault=gitlab-development IndexNameWikis=gitlab-development-wikis Permissions="&{0 20}" PermissionsWiki="<nil>" archived=false blobType=blob hashedRootNamespaceId=116 projectID=24 schemaVersionBlob=2308 schemaVersionCommit=2306 schemaVersionWiki=0 skipCommits=false traversalIds=98-
ERRO[0007] Consider lowering maximum bulk request size or/and increasing http.max_content_length bulkRequestId=1 error="elastic: Error 413 (Request Entity Too Large)" maxBulkSizeSetting=0
FATA[0007] Flushing error error="failed to perform all operations"
After
DEBUG=true ELASTIC_DEBUG=true RAILS_ENV=development ELASTIC_CONNECTION_INFO="{\"url\":[\"http://localhost:9200/\"],\"aws\":false,\"aws_access_key\":\"\",\"aws_region\":\"us-east-1\",\"aws_role_arn\":\"\",\"client_adapter\":\"typhoeus\",\"max_bulk_size_bytes\":4096,\"max_bulk_concurrency\":10,\"index_name\":\"gitlab-development\",\"index_name_commits\":\"gitlab-development-commits\",\"index_name_wikis\":\"gitlab-development-wikis\"}" GITALY_CONNECTION_INFO="{\"storage\":\"default\",\"limit_file_size\":1048576,\"address\":\"unix:/Users/terrichu/Developer/gdk/praefect.socket\",\"token\":null}" CORRELATION_ID= SSL_CERT_FILE=/opt/homebrew/etc/openssl@3/cert.pem SSL_CERT_DIR=/opt/homebrew/etc/openssl@3/certs bin/gitlab-elasticsearch-indexer --timeout=1800s --visibility-level=0 --group-id=98 --project-id=24 --from-sha=4b825dc642cb6eb9a060e54bf8d69288fbee4904 --to-sha=e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 --full-path=test-indexer-bulk-limits/my-exceeding-project --repository-access-level=20 --hashed-root-namespace-id=116 --schema-version-blob=2308 --schema-version-commit=2306 --archived=false --traversal-ids=98- @hashed/c2/35/c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db.git
INFO[0000] Setting timeout timeout=30m0s
DEBU[0000] Indexing from 4b825dc642cb6eb9a060e54bf8d69288fbee4904 to e622b74d2ded298fc6c9f074ab8c93ef54b0bc03 IndexNameCommits=gitlab-development-commits IndexNameDefault=gitlab-development IndexNameWikis=gitlab-development-wikis Permissions="&{0 20}" PermissionsWiki="<nil>" archived=false blobType=blob hashedRootNamespaceId=116 projectID=24 schemaVersionBlob=2308 schemaVersionCommit=2306 schemaVersionWiki=0 skipCommits=false traversalIds=98-
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=0 docSize=1982 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=1982 docSize=1982 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=1982 docSize=873 documentType=blob maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=873 docSize=679 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=679 docSize=683 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=683 docSize=683 documentType=commit maxBulkSize=0
DEBU[0000] Flushing bulk processor - would exceed max size currentBatchSize=683 docSize=683 documentType=commit maxBulkSize=0
Edited by Terri Chu