Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Register now

Diminishing of logging visibility over the weekend (18 & 19 Aug)

Timeline:

  • Aug 18th - 16:04 UTC: A spike of UnAcked messages of "gitaly-inf-gprd" (~600K), it resolved itself a few minutes after
  • Aug 19th - 00:00 UTC: Multiple pubsubs UnAcked messages starts increasing
  • Aug 19th - 01:30 UTC: We took notice and started investigating
  • Aug 19th - 01:50 UTC: We saw many Failed to publish events: 504 Gateway Time-out: <html><body><h1>504 Gateway Time-out</h1> entries in the pubsub node logs
  • Aug 19th - 02:00 UTC: From Elastic portal, the status of the deployment is red because it can't create snapshots
  • Aug 19th - 02:20 UTC: From Elastic portal, we saw many log entries like failed to execute pipeline for a bulk request org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@52bdd08a on ..., from our side on the pubsub nodes we saw Failed to publish events: temporary bulk send failure
  • Aug 19th - 02:24 UTC: Support forums suggests that the thread pool is over capacity
  • Aug 19th - 02:30 UTC: We decided stop pubsub on pubsub-nginx-inf-gprd and pubsub-system-inf-gprd to reduce the load
  • Aug 19th - 02:59 UTC: We still seeing the same messages (failed to execute pipeline for a bulk ...), our numbers are not going down for the other pubsubs
  • Aug 19th - 03:00 UTC: We decided to restart the deployment
  • Aug 19th - 04:11 UTC: The restart didn't finish in more than an hour, we decided to cancel it, then restart again
  • Aug 19th - 04:21 UTC: We noticed a small dip in UnAcked messages after canceling the restart
  • Aug 19th - 04:22 UTC: We aborted the second restart as it is still taking its time
  • Aug 19th - 04:31 UTC: None of the 12 ES instances are running, probably because of the aborted restart, so we restart again and let it run its course
  • Aug 19th - 06:25 UTC: The restart failed and ES started rolling back
  • Aug 19th - 07:00 UTC: The number of UnAcked messages drops to zero
  • Aug 19th - 15:23 UTC: The ES deployment turns green
  • Aug 19th - 15:50 UTC: The number of UnAcked messages starts rising again
  • Aug 20th - 01:55 UTC: The number of UnAcked messages starts drop to zero

cc/ @gl-infra

Assignee Loading
Time tracking Loading