Consolidate log shippers in kubernetes
<!-- I see you're filing an issue for the observability SRE team. Please write a clear description of the problem in the issue description, and acceptance criteria. Give the issue a terse but descriptive title. Don't forget to: - Put the issue in an epic, if one fits - assign a weight based on perceived complexity and how much of the problem space is unknown: 1, 2 or 4 See https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/observability/ for a description of our workflow. Remember to keep your epics in order of priority / necessity, to indicate to the team lead and engineering manager that you would like it to be inserted into the backlog. --> ## Problem We currently run two log shipper daemonsets in Kubernetes: - fluent bit in the kube-system namespace. This is part of GCP's default GKE configuration, and sends all container logs to stackdriver. - fluentd in the logging namespace. - At the time of writing, this is configured by us in https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master/releases/fluentd-elasticsearch. It sends logs to Elasticsearch. - After https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11655, which blocks this issue, it will likely be configured elsewhere, and send logs to pubsub, which acts as a buffer. - Only some containers (the ones running our apps) have their logs read and shipped. There are a few suboptimal consequences of this: - Since the GKE fluent bit configuration reads all container log files, we're reading the highest-throughput log files (our own apps) twice. As we migrate more workloads to kubernetes, this will become more painful. - Our own apps' logs are stored in stackdriver (as well as ES), which incurs some costs - I have not quantified this cost yet - Confusion for SREs: some log shipping config lives in our own Kubernetes config, and some is not version controlled (by us) and is part of the GKE platform). ## Desired outcome All log shipping/pipeline configuration is version controlled in one place, presumably our fluentd configuration that we currently store in https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master/releases/fluentd-elasticsearch (unless it has been moved to tanka). ## Acceptance criteria - [ ] All log shipping/pipeline configuration is version controlled in one place. - [ ] Log files are only read once in Kubernetes, and any required fanning out / routing logic is part of the log pipeline.
issue