Proposal: new .sia format
The upcoming renter overhaul necessitates an update to how we store file metadata. The goals for the new format are:
- High throughput for reading and writing
- Small filesize
- Provide metadata expected of traditional files (size, mode bits, etc.)
- Support filesharing
- Improve renter scalability
- Easy to work with and build tools around
Non-goals:
- Be extremely flexible/generic
- Be human-readable
- Scale past 10TB filesize or 1 million
.sia
files
As a recap, the current .sia
format is flat file encoded with Sia encoding. It consists of a handful of metadata fields, followed by an array of fileContract
objects:
type fileContract struct {
ID types.FileContractID
IP modules.NetAddress
Pieces []pieceData
WindowStart types.BlockHeight
}
type pieceData struct {
Chunk uint64
Piece uint64
MerkleRoot crypto.Hash
}
The format has a number of shortcomings. It uses a contract ID and NetAddress to identify a host and contract, instead of the host's public key. It provides no means for addressing less than a full sector, making partial sector downloads impossible. The entire file must be read into memory in order to manipulate it, and indeed the renter currently reads every .sia file into memory on startup. Lastly, decoding the format requires the use of a non-standard encoding protocol.
I have been thinking about a new format for a while, and I think I have a decent design ready. If nothing else, I hope we can use it as a solid base to build off of or take ideas from.
Specification
The new .sia
format is a gzipped tar archive. The name of the .sia
file is assumed to be the name of the original file. The first file in the archive is called the Index, and subsequent files are called Blobs. In the tar format, the name of the Index is "index"
and the name of each Blob is the host public key that it is associated with, as formatted by the types.SiaPublicKey.String()
method. Implementations must preserve these names when extracting and archiving .sia
files. The tar header of each file in the archive should be otherwise ignored.
Blobs are binary files that represent raw data. They contain only the information necessary to identify a set of bytes stored on a particular host. Each Blob is a flat array of entries, with each entry uniquely identifying a contiguous slice of sector data. The entry object contains a sector Merkle root, an offset within the sector, and a length. The order of the array is significant. In Go syntax, the definition of a Blob is:
type Blob []BlobEntry
type BlobEntry struct {
MerkleRoot [32]byte
Offset uint32
Length uint32
}
An Index is a JSON object that references one or more Blobs and imbues their raw data with file semantics.
type Index struct {
Version int
Filesize int64 // original file size
Mode os.FileMode // original mode bits
ModTime time.Time // time of last modification
MasterKey [32]byte // seed from which blob encryption keys are derived
MinShards int // number of shards required to recover file
Blobs []IndexBlob // len(Blobs) is the total number of shards
}
type IndexBlob struct {
HostKey types.SiaPublicKey
}
Discussion
The flow of working with the new .sia
format is as follows: the archive is first extracted to a working directory, wherein each Blob is its own file. These files are then modified as needed. For example, when uploading, siad
will repeatedly read a chunk of file data, encode it, and spawn a goroutine for each piece that uploads the sector data and appends a BlobEntry
to the host's corresponding Blob file. In our current terminology, the BlobEntry
for "piece i, chunk j" is written to the file named Index.Blobs[i].HostKey.String()
at offset j * sizeof(BlobEntry)
. When modifications are complete, the Index and Blobs are re-archived and the working directory is deleted. (When downloading, it is not necessary to extract the archive; it can simply be read into memory. Similarly, it is easy to read just the Index without extracting any files or processing any Blob data.)
This design satisfies most of the design goals above. It is simple to work with, and the technologies involved (tar, gzip, JSON) are well-supported in all mainstream languages. Compression brings the filesize close to the ideal. Files can be shared with anyone who has contracts with at least Index.MinShards
of the hosts. The Index of a file can be stored in memory while leaving the Blobs on disk, allowing the renter to quickly report the metadata of any file without excessive RAM usage or disk I/O. Performance-wise, the design allows for each Blob file to be written in parallel; and since they are separate files, programmers can freely append to them without worrying about overwriting some other part of the format (as would be the case if the .sia
format were a flat file). This also means that the metadata lives on disk rather than in RAM, which reduces memory requirements when operating on large files (or many files concurrently). Also note that it is not strictly necessary to fsync
the Blob files, because an uncommitted write merely results in one or more BlobEntry
s being lost, i.e. the data referenced by those entries will have to be re-uploaded. (The .sia
file, on the other hand, should be fsync
'd before the working directory is deleted.)
The Index and Blob types are analogous to Unix's "inode" and "block" data structures. Blobs are intended to be fairly generic; no assumptions should be made about the "shape" of the data they reference. For example, the format does not specify that a blob may only reference a given sector Merkle root once, nor does it specify that BlobEntry
s may not reference overlapping slices of a sector. The semantics of a set of Blobs are defined by their Index, which contains enough metadata to satisfy the os.FileInfo
interface. (This will make it easier to write things like FUSE layers and similar abstractions.) Since the Index is JSON-encoded and has a Version
field, it is easier to extend than the Blob
type. However, it is not a completely generic object: the specific encryption and redundancy algorithms are implied by the version number, rather than being explicitly configurable.
Shortcomings and possible improvements
Extracting each .sia
file into a working directory may strain the filesystem if too many files are being concurrently modified. The exact number likely varies by filesystem. It also means that an ugly directory is left behind if the process is killed before it can cleanup. However, in the best case scenario, this means that a user can recover the .sia
file simply by running tar -cz workdir
.
The raw nature of the Blob type makes certain calculations difficult. For example, given an offset in the original file, locating the corresponding offset in a Blob is an O(n) operation: each Length
field must be added until the sum reaches the offset. Compare this to the current format, where all pieces are assumed to be the same size; this means that an offset calculation is a simple division by the piece size. Another important calculation is determining the upload progress of a given file. Implemented naively, this would be an O(n) operation as well. We will have to employ a caching strategy to keep such calculations fast.
Some aspects of the file, such as its name and the total number of shards, are encoded implicitly rather than explicitly. Typically it's better to be explicit about such things. The only reason I hesitated with regard to the filename is that it makes it annoying to rename files: if the filename were a field in the Index, changing it would require the full sequence of extracting, decoding, encoding, and archiving, instead of a simple mv
command. This reveals a more general deficiency of the format, which is that any modification requires extraction and re-archiving. The same would not be true of a flat file with fixed offsets.
I also considered adding a Checksum
field to the Blob type. Right now, we use AEAD, so a separate checksum would be redundant. However, other implementations might not use AEAD, in which case they would require a separate checksum. Since the Blobs are (eventually) gzipped, an unused Checksum
field shouldn't increase the final filesize very much. Still, I'd rather avoid adding an unused field unless we feel very strongly that it will pay off in the long run.
EDIT 1: Removed the Filename
field from the IndexBlob
type. Instead, the host key will be used to identify the path of the blob file after extraction. However, IndexBlob
will remain a struct, instead of a bare types.SiaPublicKey
. This allows us to add fields later without breaking JSON compatibility.
One question here is how exactly we should convert the types.SiaPublicKey
to a filename. It does have a String()
method, but the return string has an algorithm prefix (ed25519:a0b1c2d3...
). Do we want the prefix in the filename? Does it matter? :
is legal in POSIX filenames but I'd have to double-check on Windows.
EDIT 2: Clarified how implementations should treat the tar headers of the Index and Blob files. (discussion)