Large values for Elasticsearch maximum bulk request size may cause excessive garbage collection in Sidekiq process
Summary
When the Maximum bulk request size
and Bulk request concurrency
Elasticsearch settings are increased to 100, any Sidekiq process that handles the cronjob:elastic_index_bulk_cron
and cronjob:elastic_index_initial_bulk_cron
queues will become bogged down by these threads performing extremely large amounts of garbage collection.
This causes all jobs performed on the impacted processes to have increasingly long scheduling_duration_s
values, getting into the thousands of seconds over time. Pipeline kickoffs and other time sensitive background jobs would often be delayed for 15-30 minutes as a result. Eventually impacted processes would often be removed by the memory killer.
In addition, db/gitaly/redis_duration_s
values are logged as unusually long when this occurs. These components are actually responding quickly, but their responses will not received by the worker threads until the GVL is unlocked, making it appear as if they are taking multiple seconds to complete.
There were no resource limitations on the host, this was occurring on a dedicated Sidekiq node with 16 cores and 31 GiB of memory. Overall CPU utilization was typically under 10%, although one or more cores was constant pegged by garbage collection. Memory usage was low, with ~23 GiB available.
The issue was partially mitigated by creating dedicated processes to handle all non-ES queues:
sidekiq['queue_groups'] = [
"feature_category=global_search",
"feature_category!=global_search",
]
This allowed all other jobs to proceed, but the ES jobs were still unhealthy. We fully resolved the problem by returning Maximum bulk request size
and Bulk request concurrency
back to their default values of 10.
Steps to reproduce
- Enable Elasticsearch on the instance that contains a large number of issues and merge requests
- Set
Maximum bulk request size
andBulk request concurrency
to 100 -
ElasticIndexInitialBulkCronWorker
andElasticIndexBulkCronWorker
jobs will create enough garbage that the GC will consume most of the CPU time of the Sidekiq process, blocking other jobs from progressing and eventually leading to scheduling latency in the thousands of seconds
What is the current bug behavior?
Sidekiq becomes bogged down with GC and cannot process jobs efficiently when large values for Maximum bulk request size
are set.
What is the expected correct behavior?
Sidekiq remains healthy regardless of the size of these value.
Relevant logs and/or screenshots
No consistent errors logged in elasticsearch.log
, though there were occasional timeouts:
{
"severity": "ERROR",
"time": "2021-02-12T04:36:29.619Z",
"correlation_id": "d2a7c7b19bdd68d7273a4879ce1af385",
"message": "bulk_exception",
"error_class": "Faraday::TimeoutError",
"error_message": "execution expired"
}
In htop
we observed that impacted Sidekiq processes would have one threads running at 100% CPU usage while the others were all idle. Taking a backtrace in gdb
, we found that all of the threads were performing a futex
except one, which was performing garbage collection:
Thread 38 (Thread 0x7f2603bff700 (LWP 4245)):
#0 mark_const_entry_i (value=139802650323360, data=0x7f2666631000) at gc.c:4929
#1 0x00007f266852e97d in rb_id_table_foreach_values (tbl=0x7f26586a36a0, func=0x7f26683cfc50 <mark_const_entry_i>, data=0x7f2666631000) at id_table.c:317
...
The remaining threads were either waiting in monitor_enter (monitor=139802430389360) at monitor.c:80
, which occurs when attempting to log an event, or on gvl_acquire_common
when performing miscellaneous tasks. Based on this, it appears that garbage collection is locking the GVL for extended periods of time, blocking all other threads from progressing.
Taking an strace of sidekiq, we observed that the active thread was performing a large number of madvise
calls as expected. Looking at the larger context of these calls, we found this thread was performing a cronjob:elastic_index_initial_bulk_cron
job.
Looking at the calls to Redis, we see it request 10,000 items from the elastic:bulk:initial
key:
*8\r\n$13\r\nzrangebyscore\r\n$27\r\nelastic:bulk:initial:0:zset\r\n$4\r\n-inf\r\n$4\r\n+inf\r\n$10\r\nWITHSCORES\r\n$5\r\nLIMIT\r\n$1\r\n0\r\n$5\r\n10000
This job was still running 17 minutes later when the trace ends, having read 129 MiB from Postgres and written 9 MiB to ES. It performed 20,682 madvise
calls during this time, spending roughly 85% of its time in userland performing GC or checking the time, which is done very frequently when in GC:
PID 18511
15120562 syscalls, active time: 256768.312ms, user time: 634376.875ms, total time: 1023154.625ms
start time: 19:32:56.845377 end time: 19:49:59.999988
syscall count total (ms) max (ms) avg (ms) min (ms) errors
----------------- -------- ---------- ---------- ---------- ---------- --------
clock_gettime 14998749 253977.031 53.178 0.017 0.006
futex 18754 131512.750 7603.952 7.013 0.006 EAGAIN: 2961 ERESTARTSYS: 3 ERESTART_RESTARTBLOCK: 1 ETIMEDOUT: 58
madvise 20682 1090.404 1.955 0.053 0.009
Possible fixes
The docs make it sound as if Maximum bulk request size
and Bulk request concurrency
are only intended to impact the gitlab-elasticsearch-indexer
process, not the jobs run within Sidekiq itself. Perhaps the jobs performed by sidekiq itself should not be updated by them?
If we do intend to include cronjob:elastic_index_bulk_cron
and cronjob:elastic_index_initial_bulk_cron
, then we should block prevent values this large from being used.