Resize production logging cluster
C3
Production Change - Criticality 3Change Component | Description |
---|---|
Change Objective | reduce the size of production logging cluster |
Change Type | Cluster resizing |
Services Impacted | ES production logging cluster |
Change Team Members | @mwasilewski-gitlab @igorwwwwwwwwwwwwwwwwwwww |
Change Criticality | C3 |
Change Reviewer or tested in staging | - |
Dry-run output | - |
Due Date | 2020-07-08 12:00:00 UTC |
Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Detailed steps for the change
- current state: 14 hot nodes, 8 warm nodes
-
scale down hot fleet from 14 nodes to 12 nodes (6 nodes x 2 zones) -
leave it for 2 business days -
scale down hot fleet from 12 nodes to 10 nodes (5 nodes x 2 zones) -
leave it for 2 business days -
scale down warm fleet from 8 to 6 (3 nodes x 2 zones) -
leave it for 2 business days -
At that point compare metrics with what we're seeing today
Rollback steps
- log in to our ES Cloud web interface: https://cloud.elastic.co/deployments (credentials in 1pass)
- go to
gitlab-logs-prod
- click
Edit
on the left hand side -
scale the cluster back to the desired size - click
Save
Monitoring
Key metrics to observe
-
Metric: Backlog Bytes
- Location: https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-7d&to=now&refresh=30s
- What changes to this metric should prompt a rollback: if the metric exceeds ~150M for more than a couple of minutes that's a sign something is wrong, if it grows continuously that's a clear signal for a roll-back
-
Metric: Oldest unacked message
- Location: https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-7d&to=now&refresh=30s
- What changes to this metric should prompt a rollback: if the metric exceeds 5 mins for more than a couple of minutes that's a signal for a rollback
-
Metric: elastic_thread_pools component saturation: Thread pool utilization
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: exceeding 50% should be considered unhealthy, exceeding 75% for more than 15min is a clear signal for a rollback
-
Metric: elastic_single_node_cpu component saturation: Average CPU Saturation per Node
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: exceeding 50% should be considered unhealthy, exceeding 75% for more than 15min is a clear signal for a rollback
Summary of infrastruture changes
-
Does this change introduce new compute instances? no -
Does this change re-size any existing compute instances? no (it doesn't resize them, it gets rid of some machines) -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? no
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by Michal Wasilewski