Zoekt: Newly created indices are instantly evicted
Summary
Right now we're seeing indices that being instantly evicted because of critical watermark level:
state: "pending_eviction",
reserved_storage_bytes: 1024,
used_storage_bytes: 1024,
watermark_level: "critical_watermark_exceeded",
The following are two scenarios for empty namespaces and non-empty namespaces
For the empty namespaces
This is happening here, required_storage_bytes is 0 from plan. In the ProvisioningService we are setting the reserved_storage_bytes to 1 kilobyte if the required_bytes = 0 https://gitlab.com/gitlab-org/gitlab/-/blob/bddaa8ee11f219638ed7d701351c914b73f31e48/ee/app/services/search/zoekt/provisioning_service.rb#L81
So, an index will be created with these attributes:
reserved_storage_bytes: 1024,
used_storage_bytes: 0
Now this worker UpdateIndexUsedStorageBytesEventWorker will call update_storage_bytes! which will update used_storage_bytes to DEFAULT_USED_STORAGE_BYTES(1.kilobyte) but it will skip to call refresh_reserved_storage_bytes because if condition will fail. Now both used_storage_bytes and reserved_storage_bytes will become 1.kilobyte. Therefore set the watermark_level to critical_watermark_exceeded in before_save callback. Thus evicting this index.
For non-empty namespaces
This is more like an edge case.
The index gets created with a nonzero ideal reserved_storage_bytes and used_storage_bytes with DEFAULT_USED_STORAGE_BYTES(1.kilobyte). Now UpdateIndexUsedStorageBytesEventWorker will start updating used_storage_bytes by summing the size_bytes from zoekt_repositories. Now the edge case is, if new big zoekt_repositories got added or existing zoekt_repositories size got changed before the index gets ready. In this case, we are continuously increasing the used_storage_bytes but skipping the update of the reserved_storage_bytes because of this condition. In the before_save callback there is a chance that the storage_percent_used will make the index critical_watermark_exceeded. Thus evicting the index even before moving to ready
Steps to reproduce
What is the current bug behavior?
What is the expected correct behavior?
Relevant logs and/or screenshots
Possible fixes
- Move the watermark_level setting from
after_savecallback toupdate_storage_bytesafter the call ofrefresh_used_storage_bytesandrefresh_reserved_storage_bytes. This ensures thatused_storage_bytesandreserved_storage_bytesare up-to-date before setting the watermark level. - Inside the
refresh_reserved_storage_bytes, don't allow the reduction ofreserved_storage_bytesif the index is notready.