Internationalisation of Mods
Mentioned in !1245 (closed), among other places.
This will affect the engine in a few ways:
- The plugin format. If we want fluent messages to be able to be included in the plugin itself in addition to in localisation files, a few changes to the plugin format will be necessary (though !833 (closed) is an alternative that would make certain things simpler).
- A mapping will need to be created between plugins and
.ftl
files (which store the localisations). - Rules need to be created to determine how localisations of plugins are loaded.
- Tools should be created for transitioning plugins from the old version to the new version
1. New Plugin
The following changes will be necessary for the new plugin version
Encoding
All strings should be encoded as utf-8.
It doesn't strictly need to be utf-8, and could be another Unicode encoding, however this matches both openmw's internal encoding and fluent's recommended encoding, and it's simpler to use just one encoding than to simultaneously support utf-16, etc.
Localisable Identifiers
Records with localisable identifiers (i.e. those whose identifiers are shown in-game) will need to have a new subrecord introduced to hold the localisation.
If I'm not missing any, this is limited to:
- Dialogue
- Interior Cells
It may also be useful to modify Skill and MagicEffect, as those currently store their localised name in GMSTs, and this could be a good time to change that.
All of these could use the FNAM
subrecord, otherwise unused in those records, to follow existing conventions.
Unfortunately this won't improve compatibility with existing localised plugins, as they would be using the non-English identifiers, but while we can't exactly redistribute fixed plugins, we could create a tool (or a patch) which updates them to use the English versions of the identifiers.
2. Fluent File Mapping
This defines how fields in a record map to messages which can be defined in .ftl
files. By setting up such a mapping it allows us to localise existing mods without having to change them significantly (e.g. by replacing the text in records with message identifiers, something which would be more complicated for modders).
Plus, using this mapping along with embedded messages means that it's not necessary to deal with a fallback language for each plugin, instead we can just always fallback to using the embedded message.
Identifiers
Project fluent's identifiers have a more limited range of characters than esp identifiers: [a-zA-Z][a-zA-Z0-9_-]*
. Fluent's identifiers are also case sensitive.
We could set up a fixed mapping between morrowind identifiers and fluent identifiers. E.g. make all fluent identifiers lowercase to handle case differences, translate spaces and special characters into hyphens, replace accented characters (in the case of non-english-developed plugins) with their non-accented version, and remove apostrophes (since that looks mildly better than replacing them with hyphens). I think this shouldn't cause any collisions in the Morrowind game data at least, though that should be double-checked.
We could also require that plugins in the new format use this new, more restricted identifier format, though it's not necessary.
It will also be necessary, for at least certain records which share names such as Script/Startup and interior/exterior PathGrid/Cell, to prefix identifiers with the name of the record type (ideally the formatted name, not the four-character code, for legibility). It would not be a bad idea to do this for all records to help ensure the identifiers are unique, and also that they don't begin with numbers (e.g. INFO records).
Examples of this mapping follow in the next section.
Fields
Many records only have one localisable field, in which case that would be the contents of the message. E.g.
gamesetting-smonthmorningstar = Morning Star
For those with multiple, the field names will need to be mapped to attributes. I think most which have multiple include a localisable name, which should be the main message identifier when possible. Exactly which field is not an attribute (if any, messages can contain nothing but attributes) will likely need to be decided on a per-record basis.
E.g.
race-high-elf = High Elf
.description = High Elf description which I'm not writing out in full here...
Parameterised Messages
The existing format, %-style formatting, isn't precisely compatible with fluent's message parameters due to many of them being positional, while fluent's message parameters require names, however one reasonable option is to map them to semi-numerical names representing positions (e.g. { $v0 }
, { $v1 }
, etc., noting that the v
is necessary since fluent identifiers must start with a letter).
Parameters such as in books and Info record text are easier. %PCName
becomes { $PCName }
, etc.
Scripts
Unlike most records, I don't think it would be a good idea to try and internationalise scripts automatically: e.g. generating identifiers for hardcoded strings in scripts won't be reliable, in that a converted Morrowind.esm
might not use the same identifiers as a mod which modifies the strings used in the scripts (and once again, the existing localisations cause a particular problem here).
It might be better just to not change existing scripts at all and introduce a localisation function to the lua API so that new scripts can make use of the system.
Unfortunately, unlike records, where we can simultaneously embed messages and have a reliable identifier for use with the localisation system (due to records having unique identifiers and fixed fields), this will need to be opt-in only, in that it's not possible to write a third-party localisation for scripted text in a mod which hardcodes strings in the scripts instead of using the localisation function.
Singular/Plural Gamesettings
I think this only applies to sNotifyMessage61/62 and sNotifyMessage63/64, which are the singular and plural forms of two different messages.
Since some languages have more than just singular and plural forms, these should be combined into a single message which can make use of fluent's select expressions to determine the behaviour.
3. Data Files
Example structure:
Data Files
└── l10n
├── en
│ ├── Morrowind.esm.ftl
│ └── textures
│ └── c_nordic01_shirt_m.dds
├── en-GB
│ └── Morrowind.esm.ftl
└── fr
├── Morrowind.esm.ftl
└── textures
└── c_nordic01_shirt_m.dds
I propose that, following fluent conventions, l10n/{localeid}
directories can be created as directories the VFS. Each plugin may have a corresponding localisation file which will be loaded in the same order as the plugins. Any messages not found in the localisation files for either the default or fallback locales will be fetched from the plugin directly.
Note that in case some combination of PLUGIN.esm, PLUGIN.esp, PLUGIN.omwaddon are present together, it's necessary to both include the original extension and the .ftl
, though it might be possible to have that be optional, so that both Morrowind.esm.ftl
and Morrowind.ftl
are considered valid localisation files for Morrowind.esm
.
I also suggest allowing localised assets to be placed inside of the localisation directories, since textures, video and even possibly meshes, can be localisable. This may make loading more complicated if we want the asset lowest in the load order, whether localised or not, to be used. On the other hand if we want to prefer localised assets when they are available, then it's just a matter of looking at the localisation directories for the requested locales prior to falling back to the default, and we could also recommend that localisable assets should always be stored in the localisation directories, though unless we want to specify a fallback locale for each data directory, it would be necessary to have them both in the appropriate localisation directory and in the regular tree (which would be the fallback).
Note that any mods only adding localisations would need to create a dummy plugin, or hijack an existing plugin's name (preferably the former).
4. Infrastructure
Two tools should be created to make the transition easier for modders.
Conversion tool
For the most part this should be fairly simple. I'm not sure if there's a good way to automatically convert the split singular/plural gamesettings, though since the grammar of the plural forms in Morrowind.esm
is rather poor anyway, I think we could probably get away with a fairly naive substitution like sticking with the plural form for both (I mean, 1 Iron Dagger has...
is more correct than 2 Iron Dagger has...
), and displaying a warning that these localisations in particular should be manually corrected.
Localisation extraction tool
A tool which can take a plugin and create a .ftl
file (fluent localisation file) containing the source text which was embedded in the plugin.
Each record would need a function that uses the identifier mapping to produce a list of fluent::ast::Message
, which could then be converted collected into a file with the other records (not currently fully supported, but one goal for fluent-cpp is to be able to turn the AST back into a valid .ftl
resource).
Context and relationships between mods
This is where this gets complicated, as a careful design may be able to ensure that relevant context is available for translators to use so that the results are correct, while still allowing localisations from mods interact nicely with each other, but it's much easier to accidentally create a system which is fundamentally flawed.
What is generally considered the best practice in general for messages containing highly variable data (e.g. something from a large list of objects) is to have one message for each possible value of the variable (see the fluent wiki article Good Practices for Developers).
E.g. for the GMST That %s is mine
, there may be context such as gender for each of the objects the message may be applied to which affects the rest of the phrase. Also note that the first person singular may vary in some languages, so mine
may depend on context, but that's fixed context, as the possible values, are known at the time of localisation (unless the first person singular is dependent on the object, where it gets more complicated), unlike the possible objects (given that mods can add new ones), so it can be handled with an inline select expression.
Unfortunately, having a separate message for each object will be problematic in a modular system, as any mod adding such a phrase would need it to be supported by all objects from all mods which might be used alongside it, otherwise it would simply be unavailable, even in the source language, for that object.
As the required context for a referenced object will vary depending on the language, it also wouldn't really be a great solution to have different versions of an object's name depending on the context.
For the moment, I think we might have to stick with partial translations, that is, just naively passing the formatted name as an argument and hoping context for it isn't too much of an issue (fortunately also the simplest solution, even if it's not a very good one). Ideally a system which allows localisation-specific APIs for message arguments could be used (basically allowing the localisation access to message attributes on messages passed as arguments, but with the localization having control over which attributes are used, if any), though there are also downsides to that option (see fluent#80, but it's not something currently supported by fluent).