Zoekt: Newly created indices are instantly evicted

Summary

Right now we're seeing indices that being instantly evicted because of critical watermark level:

 state: "pending_eviction",
 reserved_storage_bytes: 1024,
 used_storage_bytes: 1024,
 watermark_level: "critical_watermark_exceeded",

The following are two scenarios for empty namespaces and non-empty namespaces

For the empty namespaces

This is happening here, required_storage_bytes is 0 from plan. In the ProvisioningService we are setting the reserved_storage_bytes to 1 kilobyte if the required_bytes = 0 https://gitlab.com/gitlab-org/gitlab/-/blob/bddaa8ee11f219638ed7d701351c914b73f31e48/ee/app/services/search/zoekt/provisioning_service.rb#L81

So, an index will be created with these attributes:

 reserved_storage_bytes: 1024,
 used_storage_bytes: 0

Now this worker UpdateIndexUsedStorageBytesEventWorker will call update_storage_bytes! which will update used_storage_bytes to DEFAULT_USED_STORAGE_BYTES(1.kilobyte) but it will skip to call refresh_reserved_storage_bytes because if condition will fail. Now both used_storage_bytes and reserved_storage_bytes will become 1.kilobyte. Therefore set the watermark_level to critical_watermark_exceeded in before_save callback. Thus evicting this index.

For non-empty namespaces

This is more like an edge case. The index gets created with a nonzero ideal reserved_storage_bytes and used_storage_bytes with DEFAULT_USED_STORAGE_BYTES(1.kilobyte). Now UpdateIndexUsedStorageBytesEventWorker will start updating used_storage_bytes by summing the size_bytes from zoekt_repositories. Now the edge case is, if new big zoekt_repositories got added or existing zoekt_repositories size got changed before the index gets ready. In this case, we are continuously increasing the used_storage_bytes but skipping the update of the reserved_storage_bytes because of this condition. In the before_save callback there is a chance that the storage_percent_used will make the index critical_watermark_exceeded. Thus evicting the index even before moving to ready

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Possible fixes

  • Move the watermark_level setting from after_save callback to update_storage_bytes after the call of refresh_used_storage_bytes and refresh_reserved_storage_bytes. This ensures that used_storage_bytes and reserved_storage_bytes are up-to-date before setting the watermark level.
  • Inside the refresh_reserved_storage_bytes, don't allow the reduction of reserved_storage_bytes if the index is not ready.
Edited by Ravi Kumar