Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

2020-03-10 ILM errors in production logging cluster

Summary

More information will be added as we investigate the issue.

Timeline

All times UTC.

2020-02-xx

xx:xx some time in February, registry logs started being sent to indices rather than to an alias which resulted in ILM errors and indices not being deleted:

2020-03-10

10:00 - EOC takes a look at the alerts channel, notices alerts have been firing

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

reconfigure fluentd to use existing indices:
- helm chart update: gitlab-org/charts/fluentd-elasticsearch!3 (merged)
- helm release update: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/logging/-/merge_requests/7
check if there are any other alerts firing for logging infra (moved remaining work here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9522)
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9519
check if there are any other indices that are not being removed (moved here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9523)
perform a sanity check for the rest of the logging infra
- disk space on the production cluster is fine on hot and warm nodes
- nonprod cluster running out of disk space/too many active shards
- nonprod cluster is unhealthy, but we're not alerting on this
consider bumping priority for paging on ES related alerts (add a note to discuss during the DNA meeting), particularly in light of: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1762
why there were no alerts for the nonprod cluster?

Edited Mar 16, 2020 by Michal Wasilewski

Assignee

Select assignees

Time tracking