fix: stabilize branch metadata verification and protect branch heads during cleanup

Fix: prevent main branch tag loss during metadata verification

Problem

The main branch tag (dle:branch=main) intermittently disappears from all snapshots, making branches invisible and clone creation impossible. This has been observed multiple times over recent months, requiring manual intervention each time.

Root cause

VerifyBranchMetadata() in branching.go has three compounding issues:

  1. Unbounded accumulation of dle:child entries. The function calls SetRelationaddChild on every run without first clearing old dle:child/dle:parent values. When an old snapshot survives (e.g., pinned by a user-created clone), it accumulates child entries from every scheduled snapshot cycle. Over weeks, the dle:child property grows to 100+ entries.

  2. Delete-before-add pattern for dle:branch. The function first removes dle:branch=main from all snapshots, then re-adds it to the newest. When ZFS commands fail (due to the bloated properties from issue 1), the re-add never executes, leaving main absent from all snapshots.

  3. No recovery path for scheduled jobs. VerifyBranchMetadata errors were logged as warnings. Scheduled snapshot runs did not call InitBranching as a fallback, so the lost tag was never recovered automatically.

Changes

engine/internal/provision/thinclones/zfs/branching.go

  • Reset dle:parent and dle:child on all snapshots before rebuilding relationships, preventing unbounded accumulation
  • Reverse the dle:branch update order: add to new head first, then remove from non-heads, so the tag is never absent from all snapshots
  • Parse comma-separated branch values properly via splitBranches() instead of treating them as a single string
  • Downgrade non-head branch cleanup errors to warnings (the critical add-to-head step is already done)

engine/internal/provision/thinclones/zfs/zfs.go

  • Add getBranchHeadSnapshots() to identify pre-snapshots whose clone chains contain branch heads
  • Include these in the CleanupSnapshots exclusion list, preventing zfs destroy -R from cascading through branch heads

engine/internal/retrieval/engine/postgres/snapshot/physical.go

  • Add schedulerMutex to serialize scheduled snapshot and cleanup jobs, preventing concurrent execution
  • Call InitBranching after each scheduled snapshot as a fallback recovery mechanism
  • Escalate VerifyBranchMetadata errors from warning to error level

Note on dual-tag window

The add-before-delete approach means two snapshots briefly have dle:branch=main between the add and delete steps. This is a deliberate tradeoff:

  • Zero tags (old behavior): branch becomes invisible, clones fail, manual intervention required
  • Two tags (new behavior): listBranches() returns map[string]string, so only one snapshot per branch name is returned regardless — consumers always see exactly one head

An atomic swap is not possible since ZFS does not support transactional property changes across datasets.

Issue: #662 (closed)

Edited by Denis Morozov

Merge request reports

Loading