UpdateIndexUsedStorageBytesEventWorker takes 2000+ seconds
## Summary `Search::Zoekt::UpdateIndexUsedStorageBytesEventWorker` sometimes takes **2000+ seconds** (33+ minutes) to complete a single execution. This causes Sidekiq queueing SLO violations on the `elasticsearch` shard. **Related incident:** https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21344 ## Steps to reproduce 1. Have a significant number of Zoekt indices with stale `used_storage_bytes_updated_at` (i.e., `last_indexed_at >= used_storage_bytes_updated_at`) 2. The `SchedulingService` triggers `UpdateIndexUsedStorageBytesEvent` 3. The worker processes up to 1000 indices in a single batch 4. Observe execution times of 2000+ seconds ## What is the current *bug* behavior? The worker takes 2000+ seconds due to an N+1 query pattern. For each of the up to 1000 indices in a batch, it executes **~6 database queries individually**: 1. `SELECT SUM(size_bytes) FROM zoekt_repositories WHERE zoekt_index_id = ?` — per index 2. `SELECT * FROM zoekt_nodes WHERE id = ?` — lazy-loaded per index (no `includes`) 3. `SELECT SUM(reserved_storage_bytes) FROM zoekt_indices WHERE zoekt_node_id = ?` — via `node.unclaimed_storage_bytes` (1st call) 4. Same SUM query again — `node.unclaimed_storage_bytes` called a 2nd time on line 192, and `node.reserved_storage_bytes` is **not memoized** 5. Possible lazy-load of `zoekt_enabled_namespace` during validation 6. `UPDATE zoekt_indices SET ... WHERE id = ?` **Total: up to ~6,000 queries per worker execution.** The core loop in the worker: ```ruby # ee/app/workers/search/zoekt/update_index_used_storage_bytes_event_worker.rb:15-19 def handle_event(_event) indices = Index.with_stale_used_storage_bytes_updated_at.ordered_by_used_storage_updated_at indices.limit(BATCH_SIZE).each(&:update_storage_bytes_and_watermark_level!) reemit_event end ``` Each `update_storage_bytes_and_watermark_level!` call triggers multiple individual queries: ```ruby # ee/app/models/search/zoekt/index.rb:110-115 def update_storage_bytes_and_watermark_level!(skip_used_storage_bytes: false) refresh_used_storage_bytes unless skip_used_storage_bytes # SUM on zoekt_repositories refresh_reserved_storage_bytes # SUM on zoekt_indices (x2, not memoized) self.watermark_level = appropriate_watermark_level save! # UPDATE + validation queries end ``` The duplicate SUM query happens here: ```ruby # ee/app/models/search/zoekt/index.rb:188-196 def claim_reserved_storage_bytes_from_node(ideal_reserved_storage_bytes) return reserved_storage_bytes if node.unclaimed_storage_bytes <= 0 # SUM query #1 max_reservable_storage_bytes = node.unclaimed_storage_bytes + reserved_storage_bytes.to_i # SUM query #2 [ideal_reserved_storage_bytes, max_reservable_storage_bytes].min end ``` Where `node.unclaimed_storage_bytes` calls `node.reserved_storage_bytes` which is: ```ruby # ee/app/models/search/zoekt/node.rb:246-248 def reserved_storage_bytes indices.sum(:reserved_storage_bytes) # Not memoized — fresh DB query each call end ``` ## What is the expected *correct* behavior? The worker should complete in seconds, not minutes. The N+1 queries should be replaced with batch operations. ## Relevant logs and/or screenshots ![CleanShot_2026-02-20_at_14.11.23_2x](/uploads/c36f9e4be9ac23844b46adaf5962e5dc/CleanShot_2026-02-20_at_14.11.23_2x.png){width=900 height=599} [source](https://log.gprd.gitlab.net/app/r/s/hvkWu) ## Possible fixes ### 1. Batch the `SUM(size_bytes)` query (biggest win) Replace 1000 individual SUM queries with a single grouped query: ```ruby sums = ZoektRepository.where(zoekt_index_id: index_ids).group(:zoekt_index_id).sum(:size_bytes) ``` ### 2. Preload associations ```ruby indices.limit(BATCH_SIZE).includes(:node, :zoekt_enabled_namespace).each(...) ``` ### 3. Memoize `node.unclaimed_storage_bytes` (or `node.reserved_storage_bytes`) The private method `Node#reserved_storage_bytes` runs a fresh `indices.sum(:reserved_storage_bytes)` on every call. It's called twice per index in `claim_reserved_storage_bytes_from_node`. Memoizing it would halve the node-level SUM queries. ### 4. Consider a bulk UPDATE approach Instead of loading records, mutating in Ruby, and saving one by one, this could be restructured as a single `UPDATE ... FROM (subquery)` that computes `used_storage_bytes` from `zoekt_repositories` in bulk. ### 5. Reduce BATCH_SIZE or add execution time guards As a short-term mitigation, reducing `BATCH_SIZE` from 1000 or adding a time limit would prevent 2000+ second executions.
issue