2025-09-11: PubSub messages queuing in pubsub-rails-inf-gprd-sub causing Elastic logging lag

PubSub messages queuing in pubsub-rails-inf-gprd-sub causing Elastic logging lag (Severity 2 (High))

Problem: A massive backlog in the pubsub-rails-inf-gprd-sub subscription caused major delays in Elastic log data processing.

Impact: A backlog of 720 million PubSub messages led to a two-hour delay in log data ingestion, significantly affecting our ability to debug production issues. This impacted internal logging visibility but did not affect customer-facing features.

Causes: Long-running ES|QL tasks in Elasticsearch caused worker threads to reach full CPU utilization, creating a processing bottleneck in the log ingestion pipeline.

Response strategy: We identified and canceled all problematic long-running ES|QL tasks in Elasticsearch using the API. Following this, CPU usage and worker saturation normalized, and log ingestion resumed. The backlog of PubSub messages is now decreasing, though progress is slow. We also opened a support ticket with Elastic Cloud for further analysis.


This ticket was created to track INC-3869, by incident.io 🔥