fix: stabilize branch metadata verification and protect branch heads during cleanup
Fix: prevent main branch tag loss during metadata verification
Problem
The main branch tag (dle:branch=main) intermittently disappears from all snapshots, making branches invisible and clone creation impossible. This has been observed multiple times over recent months, requiring manual intervention each time.
Root cause
VerifyBranchMetadata() in branching.go has three compounding issues:
-
Unbounded accumulation of
dle:childentries. The function callsSetRelation→addChildon every run without first clearing olddle:child/dle:parentvalues. When an old snapshot survives (e.g., pinned by a user-created clone), it accumulates child entries from every scheduled snapshot cycle. Over weeks, thedle:childproperty grows to 100+ entries. -
Delete-before-add pattern for
dle:branch. The function first removesdle:branch=mainfrom all snapshots, then re-adds it to the newest. When ZFS commands fail (due to the bloated properties from issue 1), the re-add never executes, leavingmainabsent from all snapshots. -
No recovery path for scheduled jobs.
VerifyBranchMetadataerrors were logged as warnings. Scheduled snapshot runs did not callInitBranchingas a fallback, so the lost tag was never recovered automatically.
Changes
engine/internal/provision/thinclones/zfs/branching.go
- Reset
dle:parentanddle:childon all snapshots before rebuilding relationships, preventing unbounded accumulation - Reverse the
dle:branchupdate order: add to new head first, then remove from non-heads, so the tag is never absent from all snapshots - Parse comma-separated branch values properly via
splitBranches()instead of treating them as a single string - Downgrade non-head branch cleanup errors to warnings (the critical add-to-head step is already done)
engine/internal/provision/thinclones/zfs/zfs.go
- Add
getBranchHeadSnapshots()to identify pre-snapshots whose clone chains contain branch heads - Include these in the
CleanupSnapshotsexclusion list, preventingzfs destroy -Rfrom cascading through branch heads
engine/internal/retrieval/engine/postgres/snapshot/physical.go
- Add
schedulerMutexto serialize scheduled snapshot and cleanup jobs, preventing concurrent execution - Call
InitBranchingafter each scheduled snapshot as a fallback recovery mechanism - Escalate
VerifyBranchMetadataerrors from warning to error level
Note on dual-tag window
The add-before-delete approach means two snapshots briefly have dle:branch=main between the add and delete steps. This is a deliberate tradeoff:
- Zero tags (old behavior): branch becomes invisible, clones fail, manual intervention required
-
Two tags (new behavior):
listBranches()returnsmap[string]string, so only one snapshot per branch name is returned regardless — consumers always see exactly one head
An atomic swap is not possible since ZFS does not support transactional property changes across datasets.
Issue: #662 (closed)