Fix N+1 queries in UpdateIndexUsedStorageBytesEventWorker
What does this MR do and why?
Fixes N+1 queries in UpdateIndexUsedStorageBytesEventWorker that caused 2000+ second execution times (33+ minutes), violating Sidekiq queueing SLOs on the elasticsearch shard.
Related incident: gitlab-com/gl-infra/production#21344
Root cause
The worker processes up to 1000 indices per batch. For each index, it executed ~5 individual database queries:
-
SUM(size_bytes)fromzoekt_repositoriesper index (1000x) -
SELECT *fromzoekt_nodes— lazy loaded per index (1000x) -
SUM(reserved_storage_bytes)fromzoekt_indicesper node — called twice per index becauseNode#reserved_storage_bytesis not memoized (2000x) -
UPDATE zoekt_indicesper index (1000x)
Total: ~5000 queries per batch execution.
Fix
Replace per-index queries with batch operations, gated behind zoekt_batch_update_index_storage_bytes feature flag:
-
Eager load node associations via
.preload_node -
Batch fetch repository size sums with a single grouped query (
Repository.sum_size_bytes_by_index) -
Batch fetch node reserved bytes with a single grouped query (
Index.sum_reserved_storage_bytes_by_node), then maintain correctness via running delta tracking after each indexsave!
The delta tracking is important: when multiple indices share the same node, updating one index's reserved_storage_bytes changes the node's unclaimed_storage_bytes. Without tracking the delta, subsequent indices on the same node would see stale values and could over-allocate storage.
Total after fix: ~4 read queries + N update queries (~1004 total).
Database queries
Query count summary
| Query | Before | After |
|---|---|---|
| Fetch stale indices | 1 | 1 |
| Load nodes | 1000 (lazy) | 1 (eager) |
| SUM repos per index | 1000 | 1 (grouped) |
| SUM reserved per node | 2000 | 1 (grouped + cached) |
| UPDATE per index | 1000 | 1000 |
| Total | ~5001 | ~1004 |
Queries for postgres.ai validation
These queries are self-contained and can be run directly in postgres.ai.
Query 1 — Fetch stale indices (unchanged, 1x):
SELECT "zoekt_indices".*
FROM "zoekt_indices"
WHERE (last_indexed_at >= used_storage_bytes_updated_at)
ORDER BY "zoekt_indices"."used_storage_bytes_updated_at" ASC
LIMIT 1000
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146589
Query 2 — Eager load nodes (new, 1x — replaces 1000 lazy loads):
SELECT "zoekt_nodes".*
FROM "zoekt_nodes"
WHERE "zoekt_nodes"."id" IN (
SELECT DISTINCT zoekt_node_id FROM (
SELECT zoekt_node_id FROM zoekt_indices
WHERE last_indexed_at >= used_storage_bytes_updated_at
ORDER BY used_storage_bytes_updated_at ASC
LIMIT 1000
) sub
)
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146590
Query 3 — Batch SUM repository size_bytes (new, 1x — replaces 1000 per-index SUMs):
SELECT SUM("zoekt_repositories"."size_bytes") AS sum_size_bytes,
"zoekt_repositories"."zoekt_index_id"
FROM "zoekt_repositories"
WHERE "zoekt_repositories"."zoekt_index_id" IN (
SELECT id FROM zoekt_indices
WHERE last_indexed_at >= used_storage_bytes_updated_at
ORDER BY used_storage_bytes_updated_at ASC
LIMIT 1000
)
GROUP BY "zoekt_repositories"."zoekt_index_id"
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146593
Query 4 — Batch SUM reserved_storage_bytes per node (new, 1x — replaces 2000 per-node SUMs):
SELECT SUM("zoekt_indices"."reserved_storage_bytes") AS sum_reserved_storage_bytes,
"zoekt_indices"."zoekt_node_id"
FROM "zoekt_indices"
WHERE "zoekt_indices"."zoekt_node_id" IN (
SELECT DISTINCT zoekt_node_id FROM (
SELECT zoekt_node_id FROM zoekt_indices
WHERE last_indexed_at >= used_storage_bytes_updated_at
ORDER BY used_storage_bytes_updated_at ASC
LIMIT 1000
) sub
)
GROUP BY "zoekt_indices"."zoekt_node_id"
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146594
Old query A — Per-index SUM repos (was called 1000x):
SELECT SUM("zoekt_repositories"."size_bytes")
FROM "zoekt_repositories"
WHERE "zoekt_repositories"."zoekt_index_id" = (
SELECT id FROM zoekt_indices
WHERE last_indexed_at >= used_storage_bytes_updated_at
ORDER BY used_storage_bytes_updated_at ASC
LIMIT 1
)
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146595
Old query B — Per-node SUM reserved (was called 2000x):
SELECT SUM("zoekt_indices"."reserved_storage_bytes")
FROM "zoekt_indices"
WHERE "zoekt_indices"."zoekt_node_id" = (
SELECT zoekt_node_id FROM zoekt_indices
WHERE last_indexed_at >= used_storage_bytes_updated_at
ORDER BY used_storage_bytes_updated_at ASC
LIMIT 1
)
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146596
Production benchmark (rails console, read-only)
Ran a read-only benchmark on the production rails console with 441 stale indices across 36 nodes:
| Query | OLD | NEW | Improvement |
|---|---|---|---|
| SUM repos per index | 38.41s (441×) | 1.54s (1 batch) | 25× |
| Load nodes | 14.42s (441× lazy) | 63.5ms (1 eager) | 227× |
| SUM reserved per node | 35.11s (882×) | 172.7ms (1 batch) | 203× |
| Total (read queries) | 87.94s | 1.78s | ~49× faster |
Note: this was with only 441 stale indices. At the full 1000-index batch size, the old path would exceed ~200s of read queries alone — confirming the root cause of the 2000+ second production incidents.
Production replica validation (postgres.ai)
Validated against production clone: 2,938 stale indices across 36 nodes, several M repos.
New batch queries:
| Query | Execution Time | Plan |
|---|---|---|
| Fetch stale indices (LIMIT 1000) | 1.0ms | Index Scan on idx_zoekt_last_indexed_at_gt_used_storage_bytes_updated_at
|
| Eager load 36 nodes | 1.4ms | Hash Join on 36-row zoekt_nodes
|
| Batch SUM repos (1000 indices, 353K rows) | 123ms | Index-Only Scan on idx_zoekt_repositories_on_zoekt_index_id_and_size_bytes
|
| Batch SUM reserved (36 nodes, 59K rows) | 46ms | Index Scan on index_zoekt_indices_on_zoekt_node_id_and_id
|
Old per-index queries (for comparison):
| Query | Per-call time | Calls | Total estimated time |
|---|---|---|---|
| SUM repos per index | ~0.15ms | 1000 | ~150ms |
| Load node (lazy) | ~0.5ms | 1000 | ~500ms |
| SUM reserved per node | ~3.8ms (warm) / ~255ms (cold) | 2000 | ~7.6s (warm) / ~510s (cold) |
References
- Closes #590932
- Related incident: gitlab-com/gl-infra/production#21344
- Feature flag:
zoekt_batch_update_index_storage_bytes - Rollout issue: #590966
How to set up and validate locally
- Enable the feature flag:
Feature.enable(:zoekt_batch_update_index_storage_bytes) - Make some indices stale:
Search::Zoekt::Index.limit(5).update_all( used_storage_bytes_updated_at: 1.day.ago, last_indexed_at: Time.current ) - Trigger the worker and observe the query log:
ActiveRecord::Base.logger = Logger.new(STDOUT) event = Search::Zoekt::UpdateIndexUsedStorageBytesEvent.new(data: {}) Search::Zoekt::UpdateIndexUsedStorageBytesEventWorker.new.perform(event.class.name, event.data) - Verify batch queries instead of per-index queries in the log.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.