UpdateIndexUsedStorageBytesEventWorker takes 2000+ seconds
## Summary
`Search::Zoekt::UpdateIndexUsedStorageBytesEventWorker` sometimes takes **2000+ seconds** (33+ minutes) to complete a single execution. This causes Sidekiq queueing SLO violations on the `elasticsearch` shard.
**Related incident:** https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/21344
## Steps to reproduce
1. Have a significant number of Zoekt indices with stale `used_storage_bytes_updated_at` (i.e., `last_indexed_at >= used_storage_bytes_updated_at`)
2. The `SchedulingService` triggers `UpdateIndexUsedStorageBytesEvent`
3. The worker processes up to 1000 indices in a single batch
4. Observe execution times of 2000+ seconds
## What is the current *bug* behavior?
The worker takes 2000+ seconds due to an N+1 query pattern. For each of the up to 1000 indices in a batch, it executes **~6 database queries individually**:
1. `SELECT SUM(size_bytes) FROM zoekt_repositories WHERE zoekt_index_id = ?` — per index
2. `SELECT * FROM zoekt_nodes WHERE id = ?` — lazy-loaded per index (no `includes`)
3. `SELECT SUM(reserved_storage_bytes) FROM zoekt_indices WHERE zoekt_node_id = ?` — via `node.unclaimed_storage_bytes` (1st call)
4. Same SUM query again — `node.unclaimed_storage_bytes` called a 2nd time on line 192, and `node.reserved_storage_bytes` is **not memoized**
5. Possible lazy-load of `zoekt_enabled_namespace` during validation
6. `UPDATE zoekt_indices SET ... WHERE id = ?`
**Total: up to ~6,000 queries per worker execution.**
The core loop in the worker:
```ruby
# ee/app/workers/search/zoekt/update_index_used_storage_bytes_event_worker.rb:15-19
def handle_event(_event)
indices = Index.with_stale_used_storage_bytes_updated_at.ordered_by_used_storage_updated_at
indices.limit(BATCH_SIZE).each(&:update_storage_bytes_and_watermark_level!)
reemit_event
end
```
Each `update_storage_bytes_and_watermark_level!` call triggers multiple individual queries:
```ruby
# ee/app/models/search/zoekt/index.rb:110-115
def update_storage_bytes_and_watermark_level!(skip_used_storage_bytes: false)
refresh_used_storage_bytes unless skip_used_storage_bytes # SUM on zoekt_repositories
refresh_reserved_storage_bytes # SUM on zoekt_indices (x2, not memoized)
self.watermark_level = appropriate_watermark_level
save! # UPDATE + validation queries
end
```
The duplicate SUM query happens here:
```ruby
# ee/app/models/search/zoekt/index.rb:188-196
def claim_reserved_storage_bytes_from_node(ideal_reserved_storage_bytes)
return reserved_storage_bytes if node.unclaimed_storage_bytes <= 0 # SUM query #1
max_reservable_storage_bytes = node.unclaimed_storage_bytes + reserved_storage_bytes.to_i # SUM query #2
[ideal_reserved_storage_bytes, max_reservable_storage_bytes].min
end
```
Where `node.unclaimed_storage_bytes` calls `node.reserved_storage_bytes` which is:
```ruby
# ee/app/models/search/zoekt/node.rb:246-248
def reserved_storage_bytes
indices.sum(:reserved_storage_bytes) # Not memoized — fresh DB query each call
end
```
## What is the expected *correct* behavior?
The worker should complete in seconds, not minutes. The N+1 queries should be replaced with batch operations.
## Relevant logs and/or screenshots
{width=900 height=599}
[source](https://log.gprd.gitlab.net/app/r/s/hvkWu)
## Possible fixes
### 1. Batch the `SUM(size_bytes)` query (biggest win)
Replace 1000 individual SUM queries with a single grouped query:
```ruby
sums = ZoektRepository.where(zoekt_index_id: index_ids).group(:zoekt_index_id).sum(:size_bytes)
```
### 2. Preload associations
```ruby
indices.limit(BATCH_SIZE).includes(:node, :zoekt_enabled_namespace).each(...)
```
### 3. Memoize `node.unclaimed_storage_bytes` (or `node.reserved_storage_bytes`)
The private method `Node#reserved_storage_bytes` runs a fresh `indices.sum(:reserved_storage_bytes)` on every call. It's called twice per index in `claim_reserved_storage_bytes_from_node`. Memoizing it would halve the node-level SUM queries.
### 4. Consider a bulk UPDATE approach
Instead of loading records, mutating in Ruby, and saving one by one, this could be restructured as a single `UPDATE ... FROM (subquery)` that computes `used_storage_bytes` from `zoekt_repositories` in bulk.
### 5. Reduce BATCH_SIZE or add execution time guards
As a short-term mitigation, reducing `BATCH_SIZE` from 1000 or adding a time limit would prevent 2000+ second executions.
issue