Proposal: new .sia format

The upcoming renter overhaul necessitates an update to how we store file metadata. The goals for the new format are:

High throughput for reading and writing
Small filesize
Provide metadata expected of traditional files (size, mode bits, etc.)
Support filesharing
Improve renter scalability
Easy to work with and build tools around

Non-goals:

Be extremely flexible/generic
Be human-readable
Scale past 10TB filesize or 1 million .sia files

As a recap, the current .sia format is flat file encoded with Sia encoding. It consists of a handful of metadata fields, followed by an array of fileContract objects:

type fileContract struct {
	ID     types.FileContractID
	IP     modules.NetAddress
	Pieces []pieceData

	WindowStart types.BlockHeight
}

type pieceData struct {
	Chunk      uint64
	Piece      uint64
	MerkleRoot crypto.Hash
}

The format has a number of shortcomings. It uses a contract ID and NetAddress to identify a host and contract, instead of the host's public key. It provides no means for addressing less than a full sector, making partial sector downloads impossible. The entire file must be read into memory in order to manipulate it, and indeed the renter currently reads every .sia file into memory on startup. Lastly, decoding the format requires the use of a non-standard encoding protocol.

I have been thinking about a new format for a while, and I think I have a decent design ready. If nothing else, I hope we can use it as a solid base to build off of or take ideas from.

Specification

The new .sia format is a gzipped tar archive. The name of the .sia file is assumed to be the name of the original file. The first file in the archive is called the Index, and subsequent files are called Blobs. In the tar format, the name of the Index is "index" and the name of each Blob is the host public key that it is associated with, as formatted by the types.SiaPublicKey.String() method. Implementations must preserve these names when extracting and archiving .sia files. The tar header of each file in the archive should be otherwise ignored.

Blobs are binary files that represent raw data. They contain only the information necessary to identify a set of bytes stored on a particular host. Each Blob is a flat array of entries, with each entry uniquely identifying a contiguous slice of sector data. The entry object contains a sector Merkle root, an offset within the sector, and a length. The order of the array is significant. In Go syntax, the definition of a Blob is:

type Blob []BlobEntry

type BlobEntry struct {
    MerkleRoot [32]byte
    Offset     uint32
    Length     uint32
}

An Index is a JSON object that references one or more Blobs and imbues their raw data with file semantics.

type Index struct {
    Version   int
    Filesize  int64       // original file size
    Mode      os.FileMode // original mode bits
    ModTime   time.Time   // time of last modification
    MasterKey [32]byte    // seed from which blob encryption keys are derived
    MinShards int         // number of shards required to recover file
    Blobs     []IndexBlob // len(Blobs) is the total number of shards
}

type IndexBlob struct {
    HostKey  types.SiaPublicKey
}

Discussion

The flow of working with the new .sia format is as follows: the archive is first extracted to a working directory, wherein each Blob is its own file. These files are then modified as needed. For example, when uploading, siad will repeatedly read a chunk of file data, encode it, and spawn a goroutine for each piece that uploads the sector data and appends a BlobEntry to the host's corresponding Blob file. In our current terminology, the BlobEntry for "piece i, chunk j" is written to the file named Index.Blobs[i].HostKey.String() at offset j * sizeof(BlobEntry). When modifications are complete, the Index and Blobs are re-archived and the working directory is deleted. (When downloading, it is not necessary to extract the archive; it can simply be read into memory. Similarly, it is easy to read just the Index without extracting any files or processing any Blob data.)

This design satisfies most of the design goals above. It is simple to work with, and the technologies involved (tar, gzip, JSON) are well-supported in all mainstream languages. Compression brings the filesize close to the ideal. Files can be shared with anyone who has contracts with at least Index.MinShards of the hosts. The Index of a file can be stored in memory while leaving the Blobs on disk, allowing the renter to quickly report the metadata of any file without excessive RAM usage or disk I/O. Performance-wise, the design allows for each Blob file to be written in parallel; and since they are separate files, programmers can freely append to them without worrying about overwriting some other part of the format (as would be the case if the .sia format were a flat file). This also means that the metadata lives on disk rather than in RAM, which reduces memory requirements when operating on large files (or many files concurrently). Also note that it is not strictly necessary to fsync the Blob files, because an uncommitted write merely results in one or more BlobEntrys being lost, i.e. the data referenced by those entries will have to be re-uploaded. (The .sia file, on the other hand, should be fsync'd before the working directory is deleted.)

The Index and Blob types are analogous to Unix's "inode" and "block" data structures. Blobs are intended to be fairly generic; no assumptions should be made about the "shape" of the data they reference. For example, the format does not specify that a blob may only reference a given sector Merkle root once, nor does it specify that BlobEntrys may not reference overlapping slices of a sector. The semantics of a set of Blobs are defined by their Index, which contains enough metadata to satisfy the os.FileInfo interface. (This will make it easier to write things like FUSE layers and similar abstractions.) Since the Index is JSON-encoded and has a Version field, it is easier to extend than the Blob type. However, it is not a completely generic object: the specific encryption and redundancy algorithms are implied by the version number, rather than being explicitly configurable.

Shortcomings and possible improvements

Extracting each .sia file into a working directory may strain the filesystem if too many files are being concurrently modified. The exact number likely varies by filesystem. It also means that an ugly directory is left behind if the process is killed before it can cleanup. However, in the best case scenario, this means that a user can recover the .sia file simply by running tar -cz workdir.

The raw nature of the Blob type makes certain calculations difficult. For example, given an offset in the original file, locating the corresponding offset in a Blob is an O(n) operation: each Length field must be added until the sum reaches the offset. Compare this to the current format, where all pieces are assumed to be the same size; this means that an offset calculation is a simple division by the piece size. Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.

Some aspects of the file, such as its name and the total number of shards, are encoded implicitly rather than explicitly. Typically it's better to be explicit about such things. The only reason I hesitated with regard to the filename is that it makes it annoying to rename files: if the filename were a field in the Index, changing it would require the full sequence of extracting, decoding, encoding, and archiving, instead of a simple mv command. This reveals a more general deficiency of the format, which is that any modification requires extraction and re-archiving. The same would not be true of a flat file with fixed offsets.

I also considered adding a Checksum field to the Blob type. Right now, we use AEAD, so a separate checksum would be redundant. However, other implementations might not use AEAD, in which case they would require a separate checksum. Since the Blobs are (eventually) gzipped, an unused Checksum field shouldn't increase the final filesize very much. Still, I'd rather avoid adding an unused field unless we feel very strongly that it will pay off in the long run.

EDIT 1: Removed the Filename field from the IndexBlob type. Instead, the host key will be used to identify the path of the blob file after extraction. However, IndexBlob will remain a struct, instead of a bare types.SiaPublicKey. This allows us to add fields later without breaking JSON compatibility. One question here is how exactly we should convert the types.SiaPublicKey to a filename. It does have a String() method, but the return string has an algorithm prefix (ed25519:a0b1c2d3...). Do we want the prefix in the filename? Does it matter? : is legal in POSIX filenames but I'd have to double-check on Windows.

EDIT 2: Clarified how implementations should treat the tar headers of the Index and Blob files. (discussion)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information