reduced indexing capacity of the production logging cluster

Summary

recuded indexing capacity of the production logging cluster

Timeline

All times UTC.

2020-05-15

09:30 sudden increase in cpu utilization on the master, from a steady 25% to 90%:
09:40 tasks begin to be queued, but are processed (master is on the verge of saturation)
09:50 task queue begins to grow consistently (we probably reached saturation point), logs start to be queued
10:20 metrics stop being sent to the monitoring cluster, ILM is unable to run
10:48 support case is opened: https://support.elastic.co/customers/s/case/5004M00000cqs5aQAA
10:55 Incident declared from Slack
11:00 instance-41 disconnects from the master
11:35 the situation is getting worse, it can take up to 1.5m to calculate the cluster state:

May 14, 2020, 11:38:21 AM UTCWARNi67@us-central1-c
[instance-0000000067] took [1.5m], which is over [10s], to compute cluster state update for [node-left[{instance-0000000041}{VM2E5_y5S_-QhlGTk4ST5Q}{nlSzcUi5Qy-jcollw2tSiQ}{10.42.0.213}{10.42.0.213:19845}{di}{logical_availability_zone=zone-1, server_name=instance-0000000041.92c87c26b16049b0a30af16b94105528, availability_zone=us-central1-b, xpack.installed=true, data=hot, instance_configuration=gcp.data.highio.1, region=unknown-region} reason: disconnected]]

the task queue is growing rapidly:

11:45 all nodes report master as unavailable
11:48 all nodes rejoin the cluster and we start to recover with ~6k shards unavailable

Details

a number of indices are experiencing a reduced indexing capacity and as a result we are building up a backlog

Source

Incident declared by mwasilewski in Slack via /incident declare command.

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited May 14, 2020 by Michal Wasilewski