Configure max queue size for sending haproxy logs to stackdriver

Context and change summary

By default the "google_cloud" output plugin for fluentd uses a max queue size of 512 MB. This limit is enforced regardless of the chunk size.

Fluentd accumulates log records in a local buffer until either that buffer is 95% full or a configurable amount of time has elapsed since the last flush. If for some reason the write call to stackdriver fails, the buffer can be added to a local in-memory queue and retried a little later. That queue has a max size (specified in either bytes or chunks). Upon reaching that max queue size, an exception is thrown. Example with walk-through:

production#5754 (comment 710964990)

That exception also increments an error counter in the prometheus metric for the "google_cloud" plugin and its parent "copy" plugin. These errors can trigger an alert, which is how we currently detect the symptom of the backlog queue having saturated.

Fluentd does eventually catch up, so for now, we can increase the max size of the queue, but ideally we would also like to avoid accumulating a large backlog.

Note that we send identical copies of log records to multiple storage destinations:

The "google_cloud" plugin sends logs to GCP's Stackdriver Logging API. This is queriable via BigQuery. This MR only affects this route to stackdriver.
The "cloud_pubsub" plugin sends logs to GCP's PubSub service, publishing the log records as messages on a configured topic. These messages are consumed asynchronously by workers that send them to Elasticsearch to be indexed.

For background, here is a list of findings so far: production#5754 (comment 710965379)

Example queue saturation event

For reference, the following metrics confirm that currently our google_cloud plugin is configured to use:

max queue size in bytes = 512 MB
max queue size in chunks = 175 chunks

This is consistent with our current config setting of 3 MB chunk size:

3 MB/chunk * 175 chunks = 512 MB max queue size

Queue size in bytes (reaches saturation at 512 MB)

Thanos query

Queue size in chunks (reaches saturation at 175 chunks)

Thanos query

Edited Oct 22, 2021 by Matt Smiley