Configure max queue size for sending haproxy logs to stackdriver
Context and change summary
By default the "google_cloud" output plugin for fluentd uses a max queue size of 512 MB. This limit is enforced regardless of the chunk size.
Fluentd accumulates log records in a local buffer until either that buffer is 95% full or a configurable amount of time has elapsed since the last flush. If for some reason the write call to stackdriver fails, the buffer can be added to a local in-memory queue and retried a little later. That queue has a max size (specified in either bytes or chunks). Upon reaching that max queue size, an exception is thrown. Example with walk-through:
production#5754 (comment 710964990)
That exception also increments an error counter in the prometheus metric for the "google_cloud" plugin and its parent "copy" plugin. These errors can trigger an alert, which is how we currently detect the symptom of the backlog queue having saturated.
Fluentd does eventually catch up, so for now, we can increase the max size of the queue, but ideally we would also like to avoid accumulating a large backlog.
Note that we send identical copies of log records to multiple storage destinations:
- The "google_cloud" plugin sends logs to GCP's Stackdriver Logging API. This is queriable via BigQuery. This MR only affects this route to stackdriver.
- The "cloud_pubsub" plugin sends logs to GCP's PubSub service, publishing the log records as messages on a configured topic. These messages are consumed asynchronously by workers that send them to Elasticsearch to be indexed.
For background, here is a list of findings so far: production#5754 (comment 710965379)
Example queue saturation event
For reference, the following metrics confirm that currently our google_cloud
plugin is configured to use:
- max queue size in bytes = 512 MB
- max queue size in chunks = 175 chunks
This is consistent with our current config setting of 3 MB chunk size:
3 MB/chunk * 175 chunks = 512 MB max queue size