Kibana is unaccessible

Summary

Kibana is unaccessible due to high memory pressure on the master.

Timeline

All times UTC.

2020-05-09

17:54 - EOC was alerted to Kibana being down by support
17:59 - Incident declared from Slack
18:08 - IMOC paged
18:10 - The same memory pressure on the master with a warning of no master elected is displaying in the Cloud dashboard.
18:13 - Manual deployment kicked off by @AnthonySandoval to apply a failed "grow and shrink" change, which Elastic has acknowledged is suboptimal and may not work properly on clusters with a single large shard.
18:25 - @AnthonySandoval updates existing support case with Elastic engineers informing them of the issue.
18:31 - @AnthonySandoval opens a new case Elastic support case at Severity 1.
Note, again, we may not receive a response because we're outside of our Gold Plan business support hours.
18:53 - We're continuing to watch the deployment.
19:12 - @AnthonySandoval has escalated to our Elastic representative.
19:15 - Manual deployment failed. Master shows unhealthy and UI declares 195 of 1290 shards failed with reason circuit_breaking_exception.
19:18 - @AnthonySandoval puts cluster in maintenance to double the memory size of the masters.
19:18 - 8GB memory upgrade for master nodes fails multiple times.
19:40 - @AnthonySandoval stops routing traffic to the memory saturated master node.
19:44 - Memory usage on us-central-1b master increases from 30 to 70 percent utilization.
19:51 - We're watching as the number of unavailable Elastic shards available declines steadily at ~100 shards per minute.
19:55 - Kibana is responsive, but log ingestion is behind (see #2098 (comment 339292359)).
20:01 - The previous master node is still showing as "Paused" in the dashboard. The banner claims "One Elasticsearch instance is not running."
20:08 - Elasticsearch cluster is "Healthy, with warnings".
20:12 - @AnthonySandoval attempts again to resize the masters to 8GB memory.
20:33 - The dashboard shows that shards are being moved.
20:40 - The upgrade is being rolled back after recording failures.
20:42 - Elastic support has updated the ticket and is observing the shard migration. They've asked us not to make any further plan changes.
20:48 - Elastic support indicates that although the plan failed, the new master nodes were created. They will attempt to manually failover to the new larger masters.
21:20 - Elastic support has confirmed that we're nearly migrated to the new masters.
21:35 - The cluster is reporting healthy with 3 resized masters. Indexing is catching up.
23:16 - The Rails log messages on PubSub are down from a max of 141.3MM messages to 99.6MM.
01:38 - All indices have been backfilled from the PubSub queues.

05-09-2020

01:16 - The Rails log messages on PubSub are down to 14MM messages.

... Awaiting PubSub message queue to process all the messages and put them into the indexes.

See https://dashboards.gitlab.com/d/USVj3qHmk/logging?fullscreen&panelId=2&from=now-8h&to=now.

Details

Kibana is currently unaccessible due to an unhealthy ElasticSearch cluster.

Source

Incident declared by alex in Slack via /incident declare command.

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited May 09, 2020 by AnthonySandoval