reduced indexing capacity of the production logging cluster
Summary
recuded indexing capacity of the production logging cluster
Timeline
All times UTC.
2020-05-15
- 09:30 sudden increase in cpu utilization on the master, from a steady 25% to 90%:
- 09:40 tasks begin to be queued, but are processed (master is on the verge of saturation)
- 09:50 task queue begins to grow consistently (we probably reached saturation point), logs start to be queued
- 10:20 metrics stop being sent to the monitoring cluster, ILM is unable to run
- 10:48 support case is opened: https://support.elastic.co/customers/s/case/5004M00000cqs5aQAA
- 10:55 Incident declared from Slack
- 11:00 instance-41 disconnects from the master
- 11:35 the situation is getting worse, it can take up to 1.5m to calculate the cluster state:
May 14, 2020, 11:38:21 AM UTCWARNi67@us-central1-c
[instance-0000000067] took [1.5m], which is over [10s], to compute cluster state update for [node-left[{instance-0000000041}{VM2E5_y5S_-QhlGTk4ST5Q}{nlSzcUi5Qy-jcollw2tSiQ}{10.42.0.213}{10.42.0.213:19845}{di}{logical_availability_zone=zone-1, server_name=instance-0000000041.92c87c26b16049b0a30af16b94105528, availability_zone=us-central1-b, xpack.installed=true, data=hot, instance_configuration=gcp.data.highio.1, region=unknown-region} reason: disconnected]]
the task queue is growing rapidly:
- 11:45 all nodes report master as unavailable
- 11:48 all nodes rejoin the cluster and we start to recover with ~6k shards unavailable
Details
a number of indices are experiencing a reduced indexing capacity and as a result we are building up a backlog
Source
Incident declared by mwasilewski in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Michal Wasilewski