Large values for Elasticsearch maximum bulk request size may cause excessive garbage collection in Sidekiq process

Summary

When the Maximum bulk request size and Bulk request concurrency Elasticsearch settings are increased to 100, any Sidekiq process that handles the cronjob:elastic_index_bulk_cron and cronjob:elastic_index_initial_bulk_cron queues will become bogged down by these threads performing extremely large amounts of garbage collection.

This causes all jobs performed on the impacted processes to have increasingly long scheduling_duration_s values, getting into the thousands of seconds over time. Pipeline kickoffs and other time sensitive background jobs would often be delayed for 15-30 minutes as a result. Eventually impacted processes would often be removed by the memory killer.

In addition, db/gitaly/redis_duration_s values are logged as unusually long when this occurs. These components are actually responding quickly, but their responses will not received by the worker threads until the GVL is unlocked, making it appear as if they are taking multiple seconds to complete.

There were no resource limitations on the host, this was occurring on a dedicated Sidekiq node with 16 cores and 31 GiB of memory. Overall CPU utilization was typically under 10%, although one or more cores was constant pegged by garbage collection. Memory usage was low, with ~23 GiB available.

The issue was partially mitigated by creating dedicated processes to handle all non-ES queues:

sidekiq['queue_groups'] = [
"feature_category=global_search",
"feature_category!=global_search",
]

This allowed all other jobs to proceed, but the ES jobs were still unhealthy. We fully resolved the problem by returning Maximum bulk request size and Bulk request concurrency back to their default values of 10.

Steps to reproduce

Enable Elasticsearch on the instance that contains a large number of issues and merge requests
Set Maximum bulk request size and Bulk request concurrency to 100
ElasticIndexInitialBulkCronWorker and ElasticIndexBulkCronWorker jobs will create enough garbage that the GC will consume most of the CPU time of the Sidekiq process, blocking other jobs from progressing and eventually leading to scheduling latency in the thousands of seconds

What is the current bug behavior?

Sidekiq becomes bogged down with GC and cannot process jobs efficiently when large values for Maximum bulk request size are set.

What is the expected correct behavior?

Sidekiq remains healthy regardless of the size of these value.

Relevant logs and/or screenshots

No consistent errors logged in elasticsearch.log, though there were occasional timeouts:

{
  "severity": "ERROR",
  "time": "2021-02-12T04:36:29.619Z",
  "correlation_id": "d2a7c7b19bdd68d7273a4879ce1af385",
  "message": "bulk_exception",
  "error_class": "Faraday::TimeoutError",
  "error_message": "execution expired"
}

In htop we observed that impacted Sidekiq processes would have one threads running at 100% CPU usage while the others were all idle. Taking a backtrace in gdb, we found that all of the threads were performing a futex except one, which was performing garbage collection:

Thread 38 (Thread 0x7f2603bff700 (LWP 4245)):
#0  mark_const_entry_i (value=139802650323360, data=0x7f2666631000) at gc.c:4929
#1  0x00007f266852e97d in rb_id_table_foreach_values (tbl=0x7f26586a36a0, func=0x7f26683cfc50 <mark_const_entry_i>, data=0x7f2666631000) at id_table.c:317
...

The remaining threads were either waiting in monitor_enter (monitor=139802430389360) at monitor.c:80, which occurs when attempting to log an event, or on gvl_acquire_common when performing miscellaneous tasks. Based on this, it appears that garbage collection is locking the GVL for extended periods of time, blocking all other threads from progressing.

Taking an strace of sidekiq, we observed that the active thread was performing a large number of madvise calls as expected. Looking at the larger context of these calls, we found this thread was performing a cronjob:elastic_index_initial_bulk_cron job.

Looking at the calls to Redis, we see it request 10,000 items from the elastic:bulk:initial key:

*8\r\n$13\r\nzrangebyscore\r\n$27\r\nelastic:bulk:initial:0:zset\r\n$4\r\n-inf\r\n$4\r\n+inf\r\n$10\r\nWITHSCORES\r\n$5\r\nLIMIT\r\n$1\r\n0\r\n$5\r\n10000

This job was still running 17 minutes later when the trace ends, having read 129 MiB from Postgres and written 9 MiB to ES. It performed 20,682 madvise calls during this time, spending roughly 85% of its time in userland performing GC or checking the time, which is done very frequently when in GC:

PID 18511

  15120562 syscalls, active time: 256768.312ms, user time: 634376.875ms, total time: 1023154.625ms
  start time: 19:32:56.845377    end time: 19:49:59.999988

  syscall                 count    total (ms)      max (ms)      avg (ms)      min (ms)    errors
  -----------------    --------    ----------    ----------    ----------    ----------    --------
  clock_gettime        14998749    253977.031        53.178         0.017         0.006
  futex                   18754    131512.750      7603.952         7.013         0.006    EAGAIN: 2961   ERESTARTSYS: 3   ERESTART_RESTARTBLOCK: 1   ETIMEDOUT: 58
  madvise                 20682      1090.404         1.955         0.053         0.009

Possible fixes

The docs make it sound as if Maximum bulk request size and Bulk request concurrency are only intended to impact the gitlab-elasticsearch-indexer process, not the jobs run within Sidekiq itself. Perhaps the jobs performed by sidekiq itself should not be updated by them?

If we do intend to include cronjob:elastic_index_bulk_cron and cronjob:elastic_index_initial_bulk_cron, then we should block prevent values this large from being used.