UpdateIndexUsedStorageBytesEventWorker takes 2000+ seconds
Summary
Search::Zoekt::UpdateIndexUsedStorageBytesEventWorker sometimes takes 2000+ seconds (33+ minutes) to complete a single execution. This causes Sidekiq queueing SLO violations on the elasticsearch shard.
Related incident: gitlab-com/gl-infra/production#21344
Steps to reproduce
- Have a significant number of Zoekt indices with stale
used_storage_bytes_updated_at(i.e.,last_indexed_at >= used_storage_bytes_updated_at) - The
SchedulingServicetriggersUpdateIndexUsedStorageBytesEvent - The worker processes up to 1000 indices in a single batch
- Observe execution times of 2000+ seconds
What is the current bug behavior?
The worker takes 2000+ seconds due to an N+1 query pattern. For each of the up to 1000 indices in a batch, it executes ~6 database queries individually:
-
SELECT SUM(size_bytes) FROM zoekt_repositories WHERE zoekt_index_id = ?— per index -
SELECT * FROM zoekt_nodes WHERE id = ?— lazy-loaded per index (noincludes) -
SELECT SUM(reserved_storage_bytes) FROM zoekt_indices WHERE zoekt_node_id = ?— vianode.unclaimed_storage_bytes(1st call) - Same SUM query again —
node.unclaimed_storage_bytescalled a 2nd time on line 192, andnode.reserved_storage_bytesis not memoized - Possible lazy-load of
zoekt_enabled_namespaceduring validation UPDATE zoekt_indices SET ... WHERE id = ?
Total: up to ~6,000 queries per worker execution.
The core loop in the worker:
# ee/app/workers/search/zoekt/update_index_used_storage_bytes_event_worker.rb:15-19
def handle_event(_event)
indices = Index.with_stale_used_storage_bytes_updated_at.ordered_by_used_storage_updated_at
indices.limit(BATCH_SIZE).each(&:update_storage_bytes_and_watermark_level!)
reemit_event
end
Each update_storage_bytes_and_watermark_level! call triggers multiple individual queries:
# ee/app/models/search/zoekt/index.rb:110-115
def update_storage_bytes_and_watermark_level!(skip_used_storage_bytes: false)
refresh_used_storage_bytes unless skip_used_storage_bytes # SUM on zoekt_repositories
refresh_reserved_storage_bytes # SUM on zoekt_indices (x2, not memoized)
self.watermark_level = appropriate_watermark_level
save! # UPDATE + validation queries
end
The duplicate SUM query happens here:
# ee/app/models/search/zoekt/index.rb:188-196
def claim_reserved_storage_bytes_from_node(ideal_reserved_storage_bytes)
return reserved_storage_bytes if node.unclaimed_storage_bytes <= 0 # SUM query #1
max_reservable_storage_bytes = node.unclaimed_storage_bytes + reserved_storage_bytes.to_i # SUM query #2
[ideal_reserved_storage_bytes, max_reservable_storage_bytes].min
end
Where node.unclaimed_storage_bytes calls node.reserved_storage_bytes which is:
# ee/app/models/search/zoekt/node.rb:246-248
def reserved_storage_bytes
indices.sum(:reserved_storage_bytes) # Not memoized — fresh DB query each call
end
What is the expected correct behavior?
The worker should complete in seconds, not minutes. The N+1 queries should be replaced with batch operations.
Relevant logs and/or screenshots
Possible fixes
1. Batch the SUM(size_bytes) query (biggest win)
Replace 1000 individual SUM queries with a single grouped query:
sums = ZoektRepository.where(zoekt_index_id: index_ids).group(:zoekt_index_id).sum(:size_bytes)
2. Preload associations
indices.limit(BATCH_SIZE).includes(:node, :zoekt_enabled_namespace).each(...)
3. Memoize node.unclaimed_storage_bytes (or node.reserved_storage_bytes)
The private method Node#reserved_storage_bytes runs a fresh indices.sum(:reserved_storage_bytes) on every call. It's called twice per index in claim_reserved_storage_bytes_from_node. Memoizing it would halve the node-level SUM queries.
4. Consider a bulk UPDATE approach
Instead of loading records, mutating in Ruby, and saving one by one, this could be restructured as a single UPDATE ... FROM (subquery) that computes used_storage_bytes from zoekt_repositories in bulk.
5. Reduce BATCH_SIZE or add execution time guards
As a short-term mitigation, reducing BATCH_SIZE from 1000 or adding a time limit would prevent 2000+ second executions.
