Missing logs from GKE: fluentd upgrade recommended

Summary

Over in production#5657 (comment 705786157) we found that we're missing, on a continuous and ongoing basis, some logs from GKE. They are being lost by fluentd because it doesn't re-open some log files after rotation. It looks plausible that https://github.com/fluent/fluentd/pull/3294 is the fix (in particular it is linked to https://github.com/fluent/fluentd/issues/3239 which sounds very much like our scenario) which was released in 1.12.1.

This has already absorbed a lot of effort to realize that it was happening, and debugging it, and completely missing logs in ElasticSearch is very sub-optimal for debugging other unrelated issues, so I think it's fairly high priority to get fixed.

It seems to me that the very first thing we should do is upgrade to at least 1.12.1, and potentially all the way to the latest (1.41.1).

A currently affected example (subject to node churn etc), and means to identify other cases, is noted in the commetns.

Related Incident(s)

Originating issue(s): production#5657 (closed)

Desired Outcome/Acceptance criteria

No logs are lost under normal circumstances.

Associated Services

Corrective Action Issue Checklist

  • link the incident(s) this corrective action arose out of
  • give context for what problem this corrective action is trying to prevent from re-occurring
  • assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • assign a priority (this will default to 'priority::4')
Edited by Pierre Guinoiseau