Resize production logging cluster

Production Change - Criticality 3 C3

Change Component	Description
Change Objective	reduce the size of production logging cluster
Change Type	Cluster resizing
Services Impacted	ES production logging cluster
Change Team Members	@mwasilewski-gitlab @igorwwwwwwwwwwwwwwwwwwww
Change Criticality	C3
Change Reviewer or tested in staging	-
Dry-run output	-
Due Date	2020-07-08 12:00:00 UTC
Time tracking	To estimate and record times associated with changes ( including a possible rollback )

log in to our ES Cloud web interface: https://cloud.elastic.co/deployments (credentials in 1pass)
go to gitlab-logs-prod
click Edit on the left hand side
scale the cluster back to the desired size
click Save

Metric: Backlog Bytes
- Location: https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-7d&to=now&refresh=30s
- What changes to this metric should prompt a rollback: if the metric exceeds ~150M for more than a couple of minutes that's a sign something is wrong, if it grows continuously that's a clear signal for a roll-back
Metric: Oldest unacked message
- Location: https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-7d&to=now&refresh=30s
- What changes to this metric should prompt a rollback: if the metric exceeds 5 mins for more than a couple of minutes that's a signal for a rollback
Metric: elastic_thread_pools component saturation: Thread pool utilization
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: exceeding 50% should be considered unhealthy, exceeding 75% for more than 15min is a clear signal for a rollback
Metric: elastic_single_node_cpu component saturation: Average CPU Saturation per Node
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: exceeding 50% should be considered unhealthy, exceeding 75% for more than 15min is a clear signal for a rollback

Does this change introduce new compute instances? no
Does this change re-size any existing compute instances? no (it doesn't resize them, it gets rid of some machines)
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? no

Edited Jul 14, 2020 by Michal Wasilewski