Slow dblab clone list when there are many clones

dblab clone list on dblab-ci instance takes 4 something seconds to process 113 clones (some clones are real (like 20), and the rest are leftovers that were not cleaned up properly).

That leads to the web ui giving 500 errors regularly:

Error
{"hint":"","details":"HTTPConnectionPool(host='platform-chisel.production.svc.cluster.local', port=80): Read timed out. (read timeout=10) Full response: {\"error\": {\"code\": 500, \"detail\": \"HTTPConnectionPool(host='platform-chisel.production.svc.cluster.local', port=80): Read timed out. (read timeout=10)\", \"message\": \"Internal Server Error\"}, \"content\": null, \"message\": \"Internal Server Error\", \"status_code\": 500}"}

Looks like there is n+1 queryperformance problem in the code.

Analysis from Cursor below:

Analysis of `dblab clone list` Performance Issue

After analyzing the codebase, I've identified the root cause of the 4-second delay in dblab clone list. Here's my analysis:

Root Cause: N+1 Query Problem with ZFS Commands

The performance bottleneck occurs in the GetClones() method in engine/internal/cloning/base.go:

func (c *Base) GetClones() []*models.Clone {
    // ... 
    c.cloneMutex.RLock()
    for _, cloneWrapper := range c.clones {
        // This calls refreshCloneMetadata for EACH clone
        c.refreshCloneMetadata(cloneWrapper)
        clones = append(clones, cloneWrapper.Clone)
    }
    c.cloneMutex.RUnlock()
    // ...
}

The Chain of Expensive Operations

For each clone, refreshCloneMetadata() is called
This calls c.provision.GetSessionState(w.Session, w.Clone.Branch, w.Clone.ID)
Which calls fsm.GetSessionState(branch, cloneID) in the ZFS manager
This executes a full ZFS filesystem scan via m.listFilesystems(m.config.Pool.Name)

The ZFS command executed is:

zfs list -po name,used,mountpoint,compressratio,available,type,origin,creation,referenced,logicalreferenced,logicalused,usedbysnapshots,usedbychildren,dblab:datastateat,dle:branch,dle:message -S dblab:datastateat -S creation -t filesystem -r <pool_name>

Why It's Slow

Multiple Custom Properties: The command queries 16 fields including custom ZFS properties (dblab:datastateat, dle:branch, dle:message)
Full Pool Scan: Each call scans the entire ZFS pool recursively (-r flag)
N+1 Pattern: If you have N clones, this expensive ZFS command runs N times
Linear Search: After getting all filesystems, it linearly searches for the specific clone

Performance Impact

1 clone: ~400ms per ZFS command = ~400ms total
10 clones: ~400ms × 10 = ~4 seconds total
100 clones: ~400ms × 100 = ~40 seconds total

Solution Recommendations

1. Batch ZFS Queries (Recommended)

Cache the ZFS filesystem list and reuse it for all clones in a single request:

// In GetClones(), call listFilesystems once and pass the result
entries, err := m.listFilesystems(m.config.Pool.Name)
if err != nil {
    return nil, err
}

// Pass entries to each refreshCloneMetadata call
for _, cloneWrapper := range c.clones {
    c.refreshCloneMetadataWithCache(cloneWrapper, entries)
}

2. Optimize ZFS Query

Remove unnecessary fields from the ZFS query if not needed for clone listing
Consider using ZFS native libraries instead of command-line calls
Add indexes on custom properties if ZFS supports it

3. Implement Caching

Cache filesystem state for a short period (e.g., 30 seconds)
Invalidate cache only when clones are created/destroyed

4. Lazy Loading

Don't refresh metadata for clone list view
Only refresh when individual clone details are requested

5. Background Updates

Update clone metadata in a background goroutine
Serve cached data for list requests

Quick Fix Impact

Implementing solution #1 (closed) (batching) would reduce the time from 4 seconds to ~400ms for 10 clones - a 90% improvement.

The core issue is that the current architecture treats each clone metadata refresh as an independent operation, when they could all be satisfied with a single ZFS query.

Edited Aug 26, 2025 by Denis Morozov