td-agent unable to keep up with logs on haproxy nodes
We've had several incidents recently regarding fluentd errors (production#4951 (closed), production#5007 (closed)) that seem to be related to saturation of the buffer as seen in the log below:
2021-06-08 15:51:44 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2021-06-08 15:51:44 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/buffer.rb:293:in `write'" tag="haproxy"
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/buffer.rb:293:in `write'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/compat/output.rb:374:in `block in handle_stream_si
mple'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/output.rb:888:in `write_guard'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/compat/output.rb:373:in `handle_stream_simple'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/output.rb:878:in `execute_chunking'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/output.rb:808:in `emit_buffered'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/out_copy.rb:61:in `block in process'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/out_copy.rb:59:in `each'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/out_copy.rb:59:in `each_with_index'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/out_copy.rb:59:in `process'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/multi_output.rb:148:in `emit_sync'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/event_router.rb:160:in `emit_events'
2021-06-08 15:51:48 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/event_router.rb:97:in `emit_stream'
During investigation @stanhu came up with three ideas on how to fix it (production#4951 (comment 611766956)):
- See if we can optimize anything in this call flow.
- Use multi-process workers: https://docs.fluentd.org/deployment/multi-process-workers
- Switch to a Go-based (e.g. Filebeat) or C-based (e.g. fluent-bit) log aggregator