Follow up actions on the ILM errors in production
-
continue the conversation with support: https://support.elastic.co/customers/s/case/5004M00000cqitLQAQ -
thread pool saturation -
reduce number of mapping updates: -
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10173 -
add static mappings -
improve the process around making logging schema changes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10353
-
-
-
too many fields in indices: -
reduce the number of fields: -
Rails: -
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9818 -
gitlab-org/gitlab!31910 (merged) - once the static mappings are in place, we can investigate which fields we don't need and how to enforce that (in LabKit etc)
-
-
GKE: -
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9876 - we stopped sending GKE logs to ES so we should clean up any GKE related config that is left over
- re-enable GKE logs for selected fields
-
-
- an example of an err message in the cluster that points to relationship between cluster state size and timeouts:
-
[instance-0000000067] took [10.1s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
-
too many shards in the cluster - https://www.elastic.co/guide/en/elasticsearch/reference/current/avoid-oversharding.html
- we're hitting the
max_shards_per_cluster
limit - this was temporarily addressed with: gitlab-com/runbooks!2182 (merged)
-
bring back down the max_shards_per_node
: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10093 . Take into consideration the problem with a big number of shards in the cluster and &178 (closed) (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10008#note_334635720) -
remove the two additional warm nodes added to elevate memory pressure (not max_shards_per_node), let's wait with this, created a separate issue: -
scale down master nodes, let's revisit this at a later point: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10230 -
lower the frequency of index rollovers: gitlab-com/runbooks!2205 (merged) -
fix policy in the log production cluster: gitlab-com/runbooks!2212 (merged) -
we removed again all pubsub-rc-rails-*
indices and index templates, we still don't know what these are used for -
lower the frequency of rollovers further: gitlab-com/runbooks!2215 (merged) - potential fixes:
- as part of: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10094#note_341928434 we're already rolling over: Tue 22k, Wend 20k, Thur 14k, Fri 8k
- reindex/reshard
- close indices (probably not, cause we would lose HA)
- use dedicated ILM policies and index templates for rails, gitaly and workhorse (indices with lower indexing rates don't need 6 shards)
- lower the sending rate
- lower the retention period
- increase the size threshold for index rollover: https://gitlab.com/gitlab-com/runbooks/-/blob/master/elastic/managed-objects/log_gprd/ILM/gitlab-infra-ilm-policy.jsonnet#L8
- increase the time threshold for rolling over indices: https://gitlab.com/gitlab-com/runbooks/-/blob/master/elastic/managed-objects/log_gprd/ILM/gitlab-infra-ilm-policy.jsonnet#L7
- adjust
max_shards_per_node
limit (this is undesirable and will result in "oversharding"). - add more nodes
-
monitoring cluster is overloaded: - The size of monitoring data depends on the number of shards in the cluster, so by reducing the number of shards we are also reducing the size of the monitoring data. For this reason, the imbalance on data nodes will improve. The cluster is also completely healthy and operational. All in all, we can revisit this at a later point if the state of the cluster degrades further
- the problem is primarily caused by imbalance in size of the shards
- potential fix:
- stop sending monitoring data from selected (all?) clusters. Not all of them are critical and we recently added Prometheus monitoring for Elastic
-
cluster state too big leading to: - timeouts on the master, this has a very negative impact on many things, for example on the clusters ability to recover in case of failures: production#2112 (closed)
- shard allocation errors:
"cannot allocate because information about existing shard data is still being retrieved from some of the nodes"
- more details here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10094#note_343163664
- will be significantly improved by reducing number of shards and number of fields
-
misc other work -
upgrade production logging cluster to 7.7 (waiting for the release): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10227 -
Remove ML -
reduce ILM frequency? This might not be such a good idea because some logs are rolled over every 15mins. This should only be considered together with an increase in the size threshold: gitlab-com/runbooks!2236 (closed) -
check if runbooks cover what to do in case of delete step failing: gitlab-com/runbooks!2237 (merged) -
create an alert for tasks pending queue growing: gitlab-com/runbooks!2238 (merged) -
Gather more diagnostics - Issue for automating diagnostics (hot threads + flamegraph, tasks, pending tasks, cat indices): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10234
- When we start accumulating logs, see if there's saturation on master. If there is, open an urgent support case for a heap dump on the master by the support team (this will actually cause a master failover)
-
Why do ILM timeouts occur during snapshot taking? - There doesn't seem to be high resource utilization on hot/warm nodes.
- The snapshots use a dedicated thread pool so all the other operations should continue normally.
- Perhaps index creation/deletion is blocked on nodes when they are doing the snapshots?
-
What is causing master saturation after index removal (ILM retries)? - current primary suspects are:
cluster:admin/snapshot/status
andinternal:index/shard/recovery/start_recovery
tasks
- current primary suspects are:
-
cpu metrics in Prometheus are per host, not per cgroups, so Prom might be showing cpu utilization at 80%, but elastic might actually be saturated (this might already be available as a metric, it's plotted in the monitoring cluster): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10356 -
send monitoring cluster alerts to slack - The existing watches are system watches that cannot be edited
- Created: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10332 to add more watches
-
notify Elastic support after the number of shards and number of fields are reduced so that they can analyze the cluster state
-
After the cluster is stabilized:
-
reduce costs: -
continue clean up of logging infra: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8269 -
get rid of ES5 proxy: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9384 -
investigate what pubsubbeat-*
is used for and if possible remove it. The index dates back to early January, it was created as part of pubsubbeat testing (the pubsubbeat config was not referencing an index). The index was deleted -
scale back Kibana in the monitoring cluster? - Do we have any latency measurements for the monitoring cluster to confirm that increasing Kibana size didn't help? We do, we're sending monitoring data for the monitoring cluster to the monitoring cluster.
- estimated save: 75$/month
-
-
reduce send rate: -
remove logs for readiness checks: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10066 -
identify other logs that can be safely dropped, reach out to the Scalability team: Jacob for Gitaly, Shawn or Bob for Rails - potential sources of the increase:
- more user traffic
- application changes
- static-objects-cache
- gke logs
- potential fixes:
- lower the sending rate
- lower the retention period
- add more warm nodes
-
"json.tracked_items_encoded fields added to our structured logging? They seem to be adding quite a bit of extra log volume, and they’re not in a format that’s particularly useful to ELK" https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10357
-
Edited by Michal Wasilewski