Follow up actions on the ILM errors in production

[instance-0000000067] took [10.1s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]

After the cluster is stabilized:

reduce costs:
- continue clean up of logging infra: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8269
- get rid of ES5 proxy: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9384
- investigate what pubsubbeat-* is used for and if possible remove it. The index dates back to early January, it was created as part of pubsubbeat testing (the pubsubbeat config was not referencing an index). The index was deleted
- scale back Kibana in the monitoring cluster?
  - Do we have any latency measurements for the monitoring cluster to confirm that increasing Kibana size didn't help? We do, we're sending monitoring data for the monitoring cluster to the monitoring cluster.
  - estimated save: 75$/month
reduce send rate:
- remove logs for readiness checks: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10066
- identify other logs that can be safely dropped, reach out to the Scalability team: Jacob for Gitaly, Shawn or Bob for Rails
- potential sources of the increase:
  - more user traffic
  - application changes
  - static-objects-cache
  - gke logs
- potential fixes:
  - lower the sending rate
  - lower the retention period
  - add more warm nodes
- "json.tracked_items_encoded fields added to our structured logging? They seem to be adding quite a bit of extra log volume, and they’re not in a format that’s particularly useful to ELK" https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10357

Edited May 27, 2020 by Michal Wasilewski