Diminishing of logging visibility over the weekend (18 & 19 Aug)
Timeline:
- Aug 18th - 16:04 UTC: A spike of UnAcked messages of "gitaly-inf-gprd" (~600K), it resolved itself a few minutes after
- Aug 19th - 00:00 UTC: Multiple pubsubs UnAcked messages starts increasing
- Aug 19th - 01:30 UTC: We took notice and started investigating
- Aug 19th - 01:50 UTC: We saw many
Failed to publish events: 504 Gateway Time-out: <html><body><h1>504 Gateway Time-out</h1>entries in the pubsub node logs - Aug 19th - 02:00 UTC: From Elastic portal, the status of the deployment is red because it can't create snapshots
- Aug 19th - 02:20 UTC: From Elastic portal, we saw many log entries like
failed to execute pipeline for a bulk request org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@52bdd08a on ..., from our side on the pubsub nodes we sawFailed to publish events: temporary bulk send failure - Aug 19th - 02:24 UTC: Support forums suggests that the thread pool is over capacity
- Aug 19th - 02:30 UTC: We decided stop pubsub on
pubsub-nginx-inf-gprdandpubsub-system-inf-gprdto reduce the load - Aug 19th - 02:59 UTC: We still seeing the same messages (
failed to execute pipeline for a bulk ...), our numbers are not going down for the other pubsubs - Aug 19th - 03:00 UTC: We decided to restart the deployment
- Aug 19th - 04:11 UTC: The restart didn't finish in more than an hour, we decided to cancel it, then restart again
- Aug 19th - 04:21 UTC: We noticed a small dip in UnAcked messages after canceling the restart
- Aug 19th - 04:22 UTC: We aborted the second restart as it is still taking its time
- Aug 19th - 04:31 UTC: None of the 12 ES instances are running, probably because of the aborted restart, so we restart again and let it run its course
- Aug 19th - 06:25 UTC: The restart failed and ES started rolling back
- Aug 19th - 07:00 UTC: The number of UnAcked messages drops to zero
- Aug 19th - 15:23 UTC: The ES deployment turns green
- Aug 19th - 15:50 UTC: The number of UnAcked messages starts rising again
- Aug 20th - 01:55 UTC: The number of UnAcked messages starts drop to zero
cc/ @gl-infra