Skip to content

Follow up actions on the ILM errors in production

[instance-0000000067] took [10.1s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]

After the cluster is stabilized:

  • reduce costs:
    • continue clean up of logging infra: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8269
    • get rid of ES5 proxy: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9384
    • investigate what pubsubbeat-* is used for and if possible remove it. The index dates back to early January, it was created as part of pubsubbeat testing (the pubsubbeat config was not referencing an index). The index was deleted
    • scale back Kibana in the monitoring cluster?
      • Do we have any latency measurements for the monitoring cluster to confirm that increasing Kibana size didn't help? We do, we're sending monitoring data for the monitoring cluster to the monitoring cluster.
      • estimated save: 75$/month
  • reduce send rate:
Edited by Michal Wasilewski