Improved file format for omwaddon files
This is a follow-up to the idea that was originally discussed here: https://forum.openmw.org/viewtopic.php?f=3&t=6816.
In order to streamline the development of game data/mods, I'm proposing that we create a new format for plugin files to replace the current format (support for which should obviously be maintained, but the point is that this will be a newer, better format). This change is not to add to the information that is supported by the files, but rather to improve their structure to address the issues with the format itself.
My proof of concept, which was first brought up in the aforementioned thread, is called DeltaPlugin. It has come along significantly since it was first mentioned in the forums, and I've also used it to create a project for the abandoned Quill of Feyfolken mod, to demonstrate this sort of plugin (and how OpenMW mods can better integrate with version control). Note that the proof of concept tool does not have its own binary representation, and just handles its yaml-based format and esps (it also currently only supports a little under half of all Record Types. It handled almost all of them at one point, but it had some issues and I've been rewriting it one record at at time, and what is supported now is much more stable and correct than before).
This idea has quite a broad scope, so my thought was to keep this issue focused mostly on high-level ideas, and on the serialization system, though I also have many more specific ideas.
Goals
To provide a data format which:
- doesn't require "merged patches".
- has both a binary representation and at least one official text representation.
- information from all of Morrowind/Oblivion/Skyrim plugins can serialize into.
- works hand-in-hand with localization systems.
(1.) Is what I originally brought up in the forum thread, and the idea in short is to not only have plugins override as little information as possible/necessary, but to provide other ways of specifying modifications which are as non-destructive as possible (e.g. inserting into/removing from collections).
(2.) is important for interoperability. Even using a standard storage format for the information, the binary format will likely be very difficult to interact with. On the other hand if there exists an equivalent way of storing it in a text-based data format such as json (or its variants or derivatives), then you could not only edit the files in a text editor, but there would be tools in basically any programming language which could handle the files with ease. It also allows plugins to be stored in version control, and commented (sadly, not in canonical json, but some json parsers support comments, and there are also variants like hjson, json5 and yaml which support comments).
(3.) is obviously not an immediate goal; the idea is not to support Oblivion/Skyrim records right away, but rather to design a format which won't be structurally incompatible with information from the later games. This is both for looking forwards to supporting the data from the other games, as well as a reminder that we don't need to be constrained by the design decisions of the original Morrowind team. There are many ways of storing the information in Morrowind plugins, and they don't need to structurally resemble the originals to be able to support translation from the original plugin format.
(4.) is important since it would be more work to introduce later. Localized messages usually have a more complex format than what might be supported by a non-localized system, so having in-game displayed text stored by default in localization files alongside the plugin may be the simplest option, though not the only one. By the sounds of !620 (closed) localization will likely be using ICU MessageFormat via ICU4C.
Serialization System
While the existing esm structure isn't necessarily incompatible with these ideas, its specification is fairly limited, particularly when it comes to encoding complex structures.
I initially looked at CapnProto and FlatBuffers, however they are not modifiable at runtime due to having the same in-memory and on-disk format (you can change scalars in-place, but complex structures can't be modified), which is a major issue as it prevents us from having plugins modify records introduced by previous plugins (short of parsing those structures into a different internal format at runtime, which is probably more trouble than it's worth). Instead I'm recommending Protocol Buffers (a.k.a. protobuf), an older format, which has its quirks, but generally seems suitable.
The benefits of protobuf (and similar systems) include:
- Language independent structure definitions which can be used to generate code for any of a number of languages.
- Well-defined safe update mechanisms (see here for protobuf).
- Built-in support for serialization to both binary and text-based formats
Limitations of Protocol Buffers
A significant issue with Protocol buffers is that it doesn't support generics, so creating structures to use for having plugins modify structures in previously defined records will be much more verbose as separate definitions will be needed for each type (e.g. most fields will need a union { value: V, deleted: void }
).
Unfortunately the only data serialization systems I've seen which support generics are CapnProto, and Bond. CapnProto I've touched on already. Bond on the other hand lacks unions, and while it seems to support polymorphism, I'm not convinced that would work well for our use cases, and the text encodings (json and xml) don't support polymorphism anyway (see here).
Protocol Buffers also doesn't allow custom default values in the proto3
version of the syntax, and while it is supported in proto2
, proto2
doesn't support explicit omission of fields (omitted fields are just treated as their default value).
There also doesn't seem to be anything that would work nicely as a bit field (short of just using an integer and parsing it manually). CapnProto, with its method of always serializing every field, is able to pack booleans together so that up to 8 can be stored together in a byte, so it gets closest as you could just have a collection of booleans and they would be just as compact as a bit field. Unfortunately, other systems also lack this feature, so likely the best option would be to just use integers and also provide a custom tool for doing the json conversion which prettifies the bit fields by associating them with an enum and using the names in the enum to represent the bits.
Other Options
There is another option for CapnProto (which I previously said was unsuitable due to mutability concerns): There's a tool designed to be used alongside it called podgen, which allows CapnProto to be used as the serialization format, while still serializing into (mutable) standard C++ structures. While podgen is experimental and likely not stable, this wouldn't affect the plugin format itself, as that would still use just CapnProto. There's also a plan for CapnProto to support a "Plain Old C Structs" variant, which would allow mutation at the cost of no zero-copy serialization which is what CapnProto's original author had planned. At the same time CapnProto's binary format is less efficient at storing optional values than protobuf, as every field is always serialized, and while an Option type could be easily included (since CapnProto has generics), as far as I can tell it would end up taking up more space, while declaring a field optional in a protobuf message gives semantic meaning to its omission, rather than having it just take the default value, and doesn't require any changes to how the field is serialized.
There's also Apache Thrift, which is quite similar to protobuf.
It supports default values in addition to optional fields (which are implemented similarly to protocol buffers, in that they aren't serialized if omitted, but the only way of testing this at runtime is via a presumably unstable __is_set
field, where protobuf has public functions defined in the API to test the presence of optionals).
On the other hand, compared to protobuf it lacks nesting enums/structs inside other structs, and its json serialization is much less human-readable (it uses field numbers instead of their names). Its documentation also seems generally worse than protobuf's.
Thrift produces smaller generated code, as it basically maps the types into standard library types which are exposed directly to the software, while protobuf hides the implementation behind functions.