Zoekt Newly created indices are instantly evicted

What does this MR do and why?

The problem we trying to solve is that the newly created zoekt_index could get evicted without trying to bump the reserved_storage_bytes. The following are the two scenarios:

For the empty namespaces

This is happening here, required_storage_bytes is 0 from plan. In the ProvisioningService we are setting the reserved_storage_bytes to 1 kilobyte if the required_bytes = 0 https://gitlab.com/gitlab-org/gitlab/-/blob/bddaa8ee11f219638ed7d701351c914b73f31e48/ee/app/services/search/zoekt/provisioning_service.rb#L81

So, an index will be created with these attributes:

 reserved_storage_bytes: 1024,
 used_storage_bytes: 0

Now this worker UpdateIndexUsedStorageBytesEventWorker will call update_storage_bytes! which will update used_storage_bytes to DEFAULT_USED_STORAGE_BYTES(1.kilobyte) but it will skip to call refresh_reserved_storage_bytes because if condition will fail. Now both used_storage_bytes and reserved_storage_bytes will become 1.kilobyte. Therefore watermark_level will be set to critical_watermark_exceeded in before_save callback. Thus evicting this index.

For non-empty namespaces

This is more like an edge case. The index gets created with a non-zero ideal reserved_storage_bytes and used_storage_bytes with DEFAULT_USED_STORAGE_BYTES(1.kilobyte). Now UpdateIndexUsedStorageBytesEventWorker will start updating used_storage_bytes by summing the size_bytes from zoekt_repositories. Now the edge case is, if new big zoekt_repositories got added or existing zoekt_repositories size got changed before the index gets ready. In this case, we are continuously increasing the used_storage_bytes but skipping the update of the reserved_storage_bytes because of this condition. The reserved_storage_bytes is already set according to the namespace size at the time of rollout. But during the rollout, the namespace got increased. In the before_save callback there is a chance that the storage_percent_used will make the index critical_watermark_exceeded. Thus evicting the index even before moving to ready.

The fix is to

  • Move the watermark_level setting from after_save callback to update_storage_bytes after the call of refresh_used_storage_bytes and refresh_reserved_storage_bytes. This ensures that used_storage_bytes and reserved_storage_bytes are up-to-date before setting the watermark level.

  • Inside the refresh_reserved_storage_bytes, don't allow the reduction of reserved_storage_bytes if the index is not ready but allows the expansion.

  • Other changes I have made in this MR are to move some public methods to private.

  • Refactor the specs

  • Rename update_storage_bytes to update_storage_bytes_and_watermark_level!

  • Remove the DEFAULT_RESERVED_STORAGE_BYTES. It is not needed. This constant was used whenever reserved_storage_bytes was 0. With the new logic, reserved_storage_bytes can only be 0, when an index is created for an empty namespace with 0 reserved_storage_bytes. UpdateIndexUsedStorageBytesEventWorker worker will update the used_storage_bytes to 1.kilobyte. refresh_reserved_storage_bytes will also try to bump the reserved_storage_bytes but the node has 0 unclaimed_storage_bytes. In this case reserved_storage_bytes will remain 0. But I think this is fine. In the next step watermark_level will be set to critical_watermark_level and eventually will be evicted.

References

Screenshots or screen recordings

Before After

How to set up and validate locally

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #521972 (closed)

Edited by Ravi Kumar

Merge request reports

Loading