Fluentd single-threaded bottleneck
Background
Fluentd (aka td-agent) is the log collector we use to forward logs into gcp_pubsub. We run this on VMs as well as Kubernetes as a daemonset.
Fluentd is written in ruby and has a mostly single-threaded design. There is a multiprocess plugin but it is rather clunky to configure, as it requires explicitly configuring which config runs on which worker.
Problem
We are bumping into the scaling limits of fluentd on two sets of hosts:
- HAProxy fleet -- fluentd usually saturates on CPU before HAproxy does. It happens when HAProxy is close to the limit though, so it would likely saturate shortly after. We address this by scaling out the fleet horizontally.
- Kubernetes fleet, specifically API.
When this happens, we are unable to keep up with forwarding log volume to Pub/Sub (and eventually Elasticsearch), leading to delayed or lost logs.
While this does not directly impact users, it impacts our monitoring capabilities for gitlab.com, which has severe knock-on effects for availability and other parts of the business.
Evidence
We can see occasional CPU saturation on a subset of fluentd-elasticsearch
pods:
Solution
There have been proposals on how to address this:
- Optimize the workload (reduce logging volume, e.g. make machines smaller, but this hurts cost efficiency)
- Optimize existing fluentd implementation (e.g. multi-process workers)
- Switch from fluentd to a more efficient and scalable log aggregator (preliminary analysis)
I'm opening this issue to discuss whether this is a problem we should solve, and if so, how.
Proposal
Option 3 is the most sustainable way of addressing this, as it gets us away from inherent scaling limitations of fluentd.
We should evaluate doing a proof-of-concept. Consensus appears to be on Vector.
Requirements:
- Port existing fluentd configuration from high-volume log streams. Primary candidates are rails and haproxy.
- Eventually deploy to both VM fleet as well as k8s (PoC can be k8s only).
- Ingest from log files, transform log lines, publish to GCP Pub/Sub.
Out of scope:
- Fluentd archiver (also uses fluentd, may also benefit from this, but is not in the critical path for production o11y).
Estimate:
- If we limit the scope to rails on k8s, I optimistically estimate a pre-prod PoC to take 2 weeks.
- Going to production will take longer, as we need to solve meta-monitoring (metrics + logs for the service), figure out rollout logistics, as well as complete a production readiness review.
- So does expanding the scope to VMs (haproxy, postgres, redis, gitaly), and more log streams.