2020-03-10 ILM errors in production logging cluster
Summary
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-02-xx
- xx:xx some time in February, registry logs started being sent to indices rather than to an alias which resulted in ILM errors and indices not being deleted:
2020-03-10
- 10:00 - EOC takes a look at the alerts channel, notices alerts have been firing
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
-
reconfigure fluentd to use existing indices: -
helm chart update: gitlab-org/charts/fluentd-elasticsearch!3 (merged) -
helm release update: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/logging/-/merge_requests/7
-
-
check if there are any other alerts firing for logging infra (moved remaining work here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9522) -
check if there are any other indices that are not being removed (moved here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9523) -
perform a sanity check for the rest of the logging infra -
disk space on the production cluster is fine on hot and warm nodes -
nonprod cluster running out of disk space/too many active shards -
nonprod cluster is unhealthy, but we're not alerting on this
-
-
consider bumping priority for paging on ES related alerts (add a note to discuss during the DNA meeting), particularly in light of: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1762 -
why there were no alerts for the nonprod cluster?
Edited by Michal Wasilewski