Tune logging stack to handle log bursts

Summary

The current logging parameters don't perform well under burst conditions, causing OOM kills in both fluentd and root-fluentd. Several past issues/pipelines have been observed:

  1. Oversized chunks rejected by Loki gateway
    • Chunks of ~30 MB are being sent, while the Loki gateway is configured with client_max_body_size = 30MB.
    • This results in HTTP 413 Request Entity Too Large errors.
<head><title>413 Request Entity Too Large</title></head>
<body>
<center><h1>413 Request Entity Too Large</h1></center>
<hr><center>nginx/1.27.5</center>
</body>
</html>
)

Error in loki: (32MB)

2025/08/21 09:33:34 [error] 9#9: *5795 client intended to send too large body: 32275213 bytes, client: 100.72.23.30, server: , request: "POST /loki/api/v1/push HTTP/1.1", host: "loki-gateway.loki.svc.cluster.local"
  1. Per-stream rate limiting exceeded
    • Loki write configuration enforces per_stream_rate_limit = 3MB/s.
    • During stress tests, streams exceeded this limit, triggering ingestion failures with write operation failed: Per stream rate limit exceeded (limit: 3MB/sec)
level=error ts=2025-08-16T06:27:08.949808267Z caller=manager.go:50 component=ingester path=write msg="write operation failed" details="Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream '{app=\"log-spammer\", cluster=\"management-cluster\", container=\"spam\", host=\"management-cluster-cp-d780ce53ac-dt7m7\", namespace=\"log-stress\", pod=\"log-spammer-6b68bf58f6-w5xvj\", service_name=\"log-spammer\"}'

The oversized chunks adds up in consuming the memory along with stream rate limit slows down the flushing of chunks which could hit the size of PVC. These both could end up in restarting the fluentd.

related references

Details

Assignee Loading
Time tracking Loading