Elasticsearch bulk indexer can write stale data when reading from a lagged replica

Summary

ElasticIndexBulkCronWorker can index a stale version of a record into the advanced search index when the replica it reads from has not yet applied the UPDATE that triggered the enqueue. Once the stale value is written, the Redis ZSET ref is removed and no retry occurs — the index stays stale until the next write to that record.

Steps to reproduce

Hard to reproduce on demand (depends on replica lag at the moment the bulk cron tick fires). Observed in production on work item gitlab-org/gitlab#504460 (database_id 157334722):

User changes milestone_id on the work item.
after_commit :maintain_elasticsearch_update fires on the writer's connection (primary), calling Elastic::ProcessBookkeepingService.track! which does a Redis ZADD.
ElasticIndexBulkCronWorker runs shortly after, reads the record via WorkItem.id_in(ids) from a replica.
If the replica has not yet applied the UPDATE from step 1, the indexer reads the prior version of the row.
The stale row is written to Elasticsearch and the Redis ref is removed via zremrangebyscore. No retry.

What is the current bug behavior?

Elasticsearch shows the pre-change milestone_id until the next write to the record. In the observed case it stayed stale for ~1h 53m until subsequent milestone changes re-enqueued it.

What is the expected correct behavior?

The indexer either reads the post-commit row, or detects that it read pre-commit and retries.

Relevant logs and/or screenshots

Three indexing events for database_id 157334722. search_indexing_duration_s is Time.current - record.updated_at at index time:

Time	Event	`search_indexing_duration_s`
2026-05-15T09:58:49.452	`track_items` enqueue (`WorkItem\|157334722\|group_9970`)	—
2026-05-15T09:58:51.301	`indexing_done`	254293 (~2.94 days)
2026-05-15T11:51:35.473	`track_items` enqueue	—
2026-05-15T11:51:36.281	`indexing_done`	0
2026-05-15T11:51:41.983	`track_items` enqueue	—
2026-05-15T11:51:43.592	`indexing_done`	1

The 09:58:51 read returned a record whose updated_at predated the milestone change by ~3 days (the row's previous actual update). The enqueue→read gap was ~1.85s, indicating the replica was behind by at least that much and had not yet applied the milestone UPDATE.

Possible fixes

ElasticIndexBulkCronWorker declares data_consistency :sticky, but :sticky only protects against replica lag when the job is enqueued in a session that just performed the write. Here the chain is:

write happens → after_commit → Redis ZADD (no Sidekiq enqueue)
... cron tick → schedule_shards → shard worker → DB read

The Redis ZSET does not carry the write LSN. The cron-driven shard worker is enqueued from a session with no writes, so :sticky has no LSN to wait for and degrades to replica-only with no catch-up requirement. Carrying the LSN forward through the ZSET is not practical: the ZSET member is a fixed klass|id|routing string and is what makes dedup work.

The two viable approaches:

Change data_consistency to :always on ElasticIndexBulkCronWorker (and the initial variant). Smallest patch, deterministically eliminates the race. Cost is that all bulk indexer DB reads go to primary; batches are small per shard so the load impact should be measurable but bounded.
Detect and re-enqueue stale reads. After preload, compare record.updated_at against a freshness threshold; if a record is suspiciously old for a ref that should reflect a recent write, log a warn and re-track!. Has two complications to solve before it can ship:
- Initial indexing: ProcessInitialBookkeepingService backfills records that may legitimately have updated_at from years ago. A blanket age-based check would false-positive on every initial-indexed record. Needs a way to opt out for the initial path (likely an override on the subclass).
- Paused indexing: when elasticsearch_pause_indexing? is on, track! keeps writing to Redis but the cron worker no-ops. When indexing resumes, the queue contains refs whose triggering writes may be hours old. Every read after resume would look stale to a naive freshness check. Needs the check to be aware of (or suppressed during) a post-resume window.

Output of checks

This bug happens on GitLab.com.