Kibana is unaccessible
Summary
Kibana is unaccessible due to high memory pressure on the master.
Timeline
All times UTC.
2020-05-09
- 17:54 - EOC was alerted to Kibana being down by support
- 17:59 - Incident declared from Slack
- 18:08 - IMOC paged
- 18:10 - The same memory pressure on the master with a warning of no master elected is displaying in the Cloud dashboard.
- 18:13 - Manual deployment kicked off by @AnthonySandoval to apply a failed "grow and shrink" change, which Elastic has acknowledged is suboptimal and may not work properly on clusters with a single large shard.
- 18:25 - @AnthonySandoval updates existing support case with Elastic engineers informing them of the issue.
- 18:31 - @AnthonySandoval opens a new case Elastic support case at Severity 1.
Note, again, we may not receive a response because we're outside of our Gold Plan business support hours. - 18:53 - We're continuing to watch the deployment.
- 19:12 - @AnthonySandoval has escalated to our Elastic representative.
- 19:15 - Manual deployment failed. Master shows unhealthy and UI declares 195 of 1290 shards failed with reason circuit_breaking_exception.
- 19:18 - @AnthonySandoval puts cluster in maintenance to double the memory size of the masters.
- 19:18 - 8GB memory upgrade for master nodes fails multiple times.
- 19:40 - @AnthonySandoval stops routing traffic to the memory saturated master node.
- 19:44 - Memory usage on us-central-1b master increases from 30 to 70 percent utilization.
- 19:51 - We're watching as the number of unavailable Elastic shards available declines steadily at ~100 shards per minute.
- 19:55 - Kibana is responsive, but log ingestion is behind (see #2098 (comment 339292359)).
- 20:01 - The previous master node is still showing as "Paused" in the dashboard. The banner claims "One Elasticsearch instance is not running."
- 20:08 - Elasticsearch cluster is "Healthy, with warnings".
- 20:12 - @AnthonySandoval attempts again to resize the masters to 8GB memory.
- 20:33 - The dashboard shows that shards are being moved.
- 20:40 - The upgrade is being rolled back after recording failures.
- 20:42 - Elastic support has updated the ticket and is observing the shard migration. They've asked us not to make any further plan changes.
- 20:48 - Elastic support indicates that although the plan failed, the new master nodes were created. They will attempt to manually failover to the new larger masters.
- 21:20 - Elastic support has confirmed that we're nearly migrated to the new masters.
- 21:35 - The cluster is reporting healthy with 3 resized masters. Indexing is catching up.
- 23:16 - The Rails log messages on PubSub are down from a max of 141.3MM messages to 99.6MM.
- 01:38 - All indices have been backfilled from the PubSub queues.
05-09-2020
- 01:16 - The Rails log messages on PubSub are down to 14MM messages.
... Awaiting PubSub message queue to process all the messages and put them into the indexes.
See https://dashboards.gitlab.com/d/USVj3qHmk/logging?fullscreen&panelId=2&from=now-8h&to=now.
Details
Kibana is currently unaccessible due to an unhealthy ElasticSearch cluster.
Source
Incident declared by alex in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by AnthonySandoval