Rework dialogue hypertext parsing (#7297) (!5146) · Merge requests · OpenMW / openmw

The goals were to:

Consistently highlight both explicit and implicit hyperlinks in the journal and the dialogue window (this was previously a privilege of topic learning)
Implement MRK file parsing
Reduce the colossal amount of code duplication between the three parser implementations.

To achieve this, basically everything involving parsing was rewritten (except the implicit keyword search, it seemed fine).

The dialogue system technically supported what I wanted to do with MRK, but there were... issues.

The following expects you to know what implicit links and explicit links are, so consult the tests if this looks very confusing.

Implicit highlighting relies on "seeding" keyword->value pairs into a trie. Highlighting tries to find the longest keyword that covers some part of the text, and returns a token that links it to the expected value.

There are 3 main things that do hypertext parsing:

topic learning. The dialogue store seeds all topics into its own keyword search and the dialogue manager uses parseHyperText and uses the string tokens it compiles to collect the valid IDs and see if any highlighted topic can be learned.
journal entry rendering. It seeds topic ID->topic record pointer into the keyword search. It manually parses the explicit links and generates the plain text representation of the entry in the process. It validates the found hyperlinks against the known topics it seeded into the search (so as to get a valid topic link). If there are no hyperlinks, it falls back to implicit highlighting. One bad part about this other than the obvious is is that it redoes implicit highlighting every time the record is rendered.
dialogue rendering. It does basically the same thing except it doesn't use the keyword search to determine if the explicit link is good, it uses the list stored in the window. Which, to be fair, is better, because the journal's keyword search checks aren't exactly O(log(n)).

I don't include tests in this list because, well, they're not particularly relevant here.

Hypertext parser namespace, i.e. the one good implementation, was integrated into the keyword search class so that keyword search could manage almost everything involved. The parser always required an instance of that class to work either way.

To make it easier to use the tokens, I wanted to merge the concept of token the hypertext parser used with the concept of a match the keyword search used. However, keyword search is a template class, it supports arbitrary "values" (i.e. the ID or topic record link). This is inconvenient because explicit hyperlinks aren't seeded into it so they don't have a "value". Hypertext parser meanwhile only supported text "values", since the useful information it extracted out of explicit hyperlinks was the text between the @# tags.

So I decided to change the meaning of the things that a token contains:

mBeg and mEnd cover the entire segment of text that the link (or rather, the keyword) covers, i.e. they're the iterators towards @ and the character after the # in case of explicit hyperlinks (no difference for implicit links). This makes it easier to figure out what position the plain text before and after the links ends and starts.
mValue is the actual ID, i.e., for implicit links it's not the keyword, but the actual ID relevant to that keyword. This means that the TOP localization part (standard form conversion) happens during hypertext parsing. All the users had to do that either way, so it's more convenient if they get the processed ID.
Explicit/Implicit differentiation. It's inherited from the original hypertext parser (wasn't included in keyword search tokens which were implicitly implicit). Outside of the class it shouldn't matter what type they are, it's mainly relevant within it.

So keyword search is no longer a template class, it always returns string tokens. The reasoning was that all the users needed to parse the ID either way and getting its pointer equivalent from the journal's and the dialogue window's lists is constant time. Or, well, it is now.

The dialogue store's keyword search was moved into the dialogue manager (the only thing that uses it) as that seemed more appropriate, it's a rather specific application of the parser. The dialogue manager still checks the mod flag from the dialogue store to see if it needs to reseed its search. As an upside, store.hpp has less stuff in it.

The dialogue manager also uses the actual ID rather than the keyword to compile its topic list.

The dialogue window and the journal both were switched to use the hypertext parser. They use the matches it returns to generate the text without the @# tags and a more convenient token list in the process (i.e. positions in the new text and entry link pointers). The display name generation is handled by the match struct itself so the user doesn't need to differentiate between the types of links. The dialogue window now writes the text in one slightly more optimal way rather than two ways depending on the type of the links, while the journal no longer redoes implicit highlighting until the keyword search is invalidated. The journal also uses its own list of links rather than uses the keyword search function (it sucks so it was removed).

And now for the most interesting part.

MRK files override the keyword seeded into the keyword search.

...

Ok, uh, that's basically it, actually. There's nothing particularly notable about that. Official localization relies on this to kill implicit highlighting for Russian topics and use it for English topics instead, to have some limited compatibility with mods that rely on the English dialogue.

Well, one thing notable is that it makes it possible for one keyword to be linked to multiple topics. This is evil. Don't do that. There's no good approach towards solving this, as the seeding is completely arbitrary. This is the source of a lot of confusion during research; it essentially means an existing topic MUST NOT be a second word in a MRK pair.

Note Cyrillic text is always case-sensitive.

Note TESCS removes @# from topic responses (Akella used custom tools to translate the text) so no English content made in it should be affected by this, and it's probably not sensible to do something different if there are no localization files.

I'm not sure if the pseudoasterisk handling is totally accurate, but this should be fine for the official content.

There's also the part where explicit hyperlinks should have new lines replaced by spaces but raises too many questions so I didn't bother with it here.

And of course there are tests. Unfortunately it's difficult to write tests for the dialogue window or the journal.

Addendum:

The journal/dialogue window still duplicate some code since they do the same thing (display text generation and matches within the display text). The duplicated fragments are short enough now that this shouldn't be a major issue.
Dialogue manager keyword search setup is sucky and assumes there'll only be one user. I really just copied the original design when moving it without particularly significant changes. Maybe the dialogue store should implement a listener kind of thing.
Doesn't address #8978 (closed).

Edited Feb 25, 2026 by Alexei Kotov

Rework dialogue hypertext parsing (#7297)

Merge request reports