Fix N+1 queries in UpdateIndexUsedStorageBytesEventWorker

What does this MR do and why?

Fixes N+1 queries in UpdateIndexUsedStorageBytesEventWorker that caused 2000+ second execution times (33+ minutes), violating Sidekiq queueing SLOs on the elasticsearch shard.

Related incident: gitlab-com/gl-infra/production#21344

Root cause

The worker processes up to 1000 indices per batch. For each index, it executed ~5 individual database queries:

  1. SUM(size_bytes) from zoekt_repositories per index (1000x)
  2. SELECT * from zoekt_nodes — lazy loaded per index (1000x)
  3. SUM(reserved_storage_bytes) from zoekt_indices per node — called twice per index because Node#reserved_storage_bytes is not memoized (2000x)
  4. UPDATE zoekt_indices per index (1000x)

Total: ~5000 queries per batch execution.

Fix

Replace per-index queries with batch operations, gated behind zoekt_batch_update_index_storage_bytes feature flag:

  • Eager load node associations via .preload_node
  • Batch fetch repository size sums with a single grouped query (Repository.sum_size_bytes_by_index)
  • Batch fetch node reserved bytes with a single grouped query (Index.sum_reserved_storage_bytes_by_node), then maintain correctness via running delta tracking after each index save!

The delta tracking is important: when multiple indices share the same node, updating one index's reserved_storage_bytes changes the node's unclaimed_storage_bytes. Without tracking the delta, subsequent indices on the same node would see stale values and could over-allocate storage.

Total after fix: ~4 read queries + N update queries (~1004 total).

Database queries

Query count summary

Query Before After
Fetch stale indices 1 1
Load nodes 1000 (lazy) 1 (eager)
SUM repos per index 1000 1 (grouped)
SUM reserved per node 2000 1 (grouped + cached)
UPDATE per index 1000 1000
Total ~5001 ~1004

Queries for postgres.ai validation

These queries are self-contained and can be run directly in postgres.ai.

Query 1 — Fetch stale indices (unchanged, 1x):

SELECT "zoekt_indices".*
FROM "zoekt_indices"
WHERE (last_indexed_at >= used_storage_bytes_updated_at)
ORDER BY "zoekt_indices"."used_storage_bytes_updated_at" ASC
LIMIT 1000

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146589

Query 2 — Eager load nodes (new, 1x — replaces 1000 lazy loads):

SELECT "zoekt_nodes".*
FROM "zoekt_nodes"
WHERE "zoekt_nodes"."id" IN (
  SELECT DISTINCT zoekt_node_id FROM (
    SELECT zoekt_node_id FROM zoekt_indices
    WHERE last_indexed_at >= used_storage_bytes_updated_at
    ORDER BY used_storage_bytes_updated_at ASC
    LIMIT 1000
  ) sub
)

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146590

Query 3 — Batch SUM repository size_bytes (new, 1x — replaces 1000 per-index SUMs):

SELECT SUM("zoekt_repositories"."size_bytes") AS sum_size_bytes,
  "zoekt_repositories"."zoekt_index_id"
FROM "zoekt_repositories"
WHERE "zoekt_repositories"."zoekt_index_id" IN (
  SELECT id FROM zoekt_indices
  WHERE last_indexed_at >= used_storage_bytes_updated_at
  ORDER BY used_storage_bytes_updated_at ASC
  LIMIT 1000
)
GROUP BY "zoekt_repositories"."zoekt_index_id"

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146593

Query 4 — Batch SUM reserved_storage_bytes per node (new, 1x — replaces 2000 per-node SUMs):

SELECT SUM("zoekt_indices"."reserved_storage_bytes") AS sum_reserved_storage_bytes,
  "zoekt_indices"."zoekt_node_id"
FROM "zoekt_indices"
WHERE "zoekt_indices"."zoekt_node_id" IN (
  SELECT DISTINCT zoekt_node_id FROM (
    SELECT zoekt_node_id FROM zoekt_indices
    WHERE last_indexed_at >= used_storage_bytes_updated_at
    ORDER BY used_storage_bytes_updated_at ASC
    LIMIT 1000
  ) sub
)
GROUP BY "zoekt_indices"."zoekt_node_id"

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146594

Old query A — Per-index SUM repos (was called 1000x):

SELECT SUM("zoekt_repositories"."size_bytes")
FROM "zoekt_repositories"
WHERE "zoekt_repositories"."zoekt_index_id" = (
  SELECT id FROM zoekt_indices
  WHERE last_indexed_at >= used_storage_bytes_updated_at
  ORDER BY used_storage_bytes_updated_at ASC
  LIMIT 1
)

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146595

Old query B — Per-node SUM reserved (was called 2000x):

SELECT SUM("zoekt_indices"."reserved_storage_bytes")
FROM "zoekt_indices"
WHERE "zoekt_indices"."zoekt_node_id" = (
  SELECT zoekt_node_id FROM zoekt_indices
  WHERE last_indexed_at >= used_storage_bytes_updated_at
  ORDER BY used_storage_bytes_updated_at ASC
  LIMIT 1
)

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/49026/commands/146596

Production benchmark (rails console, read-only)

Ran a read-only benchmark on the production rails console with 441 stale indices across 36 nodes:

Query OLD NEW Improvement
SUM repos per index 38.41s (441×) 1.54s (1 batch) 25×
Load nodes 14.42s (441× lazy) 63.5ms (1 eager) 227×
SUM reserved per node 35.11s (882×) 172.7ms (1 batch) 203×
Total (read queries) 87.94s 1.78s ~49× faster

Note: this was with only 441 stale indices. At the full 1000-index batch size, the old path would exceed ~200s of read queries alone — confirming the root cause of the 2000+ second production incidents.

Production replica validation (postgres.ai)

Validated against production clone: 2,938 stale indices across 36 nodes, several M repos.

New batch queries:

Query Execution Time Plan
Fetch stale indices (LIMIT 1000) 1.0ms Index Scan on idx_zoekt_last_indexed_at_gt_used_storage_bytes_updated_at
Eager load 36 nodes 1.4ms Hash Join on 36-row zoekt_nodes
Batch SUM repos (1000 indices, 353K rows) 123ms Index-Only Scan on idx_zoekt_repositories_on_zoekt_index_id_and_size_bytes
Batch SUM reserved (36 nodes, 59K rows) 46ms Index Scan on index_zoekt_indices_on_zoekt_node_id_and_id

Old per-index queries (for comparison):

Query Per-call time Calls Total estimated time
SUM repos per index ~0.15ms 1000 ~150ms
Load node (lazy) ~0.5ms 1000 ~500ms
SUM reserved per node ~3.8ms (warm) / ~255ms (cold) 2000 ~7.6s (warm) / ~510s (cold)

References

How to set up and validate locally

  1. Enable the feature flag:
    Feature.enable(:zoekt_batch_update_index_storage_bytes)
  2. Make some indices stale:
    Search::Zoekt::Index.limit(5).update_all(
      used_storage_bytes_updated_at: 1.day.ago,
      last_indexed_at: Time.current
    )
  3. Trigger the worker and observe the query log:
    ActiveRecord::Base.logger = Logger.new(STDOUT)
    event = Search::Zoekt::UpdateIndexUsedStorageBytesEvent.new(data: {})
    Search::Zoekt::UpdateIndexUsedStorageBytesEventWorker.new.perform(event.class.name, event.data)
  4. Verify batch queries instead of per-index queries in the log.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dmitry Gruzd

Merge request reports

Loading