Skip to content

Draft: More direct language handling

Michal Vlasák requested to merge mvlasak/minim:languages into master

Previous TeX engines had the limitation of being able to load hyphenation patterns only at format creation time - when running iniTeX. LuaTeX has no such limitation and by using Lua it is possible to load hyphenation patterns at any time.

Today virtually all hyphenation patterns and exceptions that have been used by TeX users are distributed in the hyph-utf8 package. hyph-utf8 provides patterns/exceptions also in the new UTF-8 encoded plain text files that are preferred for LuaTeX.

TeXLive's approach is to provide hyphenation patterns/exceptions for each language in a separate package. Each package then hooks itself using the TeXLive execute AddHyphen directive. An example for French:

execute AddHyphen \
    name=french synonyms=patois,francais \
    lefthyphenmin=2 righthyphenmin=2 \
    file=loadhyph-fr.tex \
    file_patterns=hyph-fr.pat.txt \
    file_exceptions=

This information is also written to files used by eTeX language mechanism, which is used by plain LuaTeX. This gets added to language.def:

\addlanguage{french}{loadhyph-fr.tex}{}{2}{2}

and this is written to language.dat.lua:

['french'] = {
    loader = 'loadhyph-fr.tex',
    lefthyphenmin = 2,
    righthyphenmin = 2,
    synonyms = { 'patois', 'francais' },
    patterns = 'hyph-fr.pat.txt',
    hyphenation = '',
},

etex.src reads language.def at format creation time. Listed languages are registered and their hyphenation patterns loaded into the format. This enables their use later with \uselanguage.

As for LuaTeX it is even discouraged to load patterns into format, the mechanism is changed by hyph-utf8's own etex.src. Instead of loading each pattern or exception file on \addlanguage, the language is only registred and the files are loaded on first \uselanguage. Both commands, actually use Lua code in luatex-hyphen.lua, which gets information from language.dat.lua database.

But why use the language.def file at all? It's situation with synonyms isn't all that great, and information about left/right hyphenmins is already present in language.dat.lua.

The approach in this pull request separates language handling from minim-etex and introduces minim-languages (which depends on callbacks, and alloc). Most of the stuff happens in Lua, where \newlanguage and \minim:uselanguage are defined. Both use LuaTeX's lang.new() function to allocate language numbers and not the classical TeX count register 19. There is no longer \addlanguage.

The real \uselanguage is defined from TeX and keeps \uselanguage@hook, which actually was used by minim before. This pull request changes that, and instead defines a custom callback, which allows anybody to hook into \uselanguage. For compatibility, the TeX hook is also kept. \minim:uselanguage could in fact be omitted and instead just register into the callback.

Part of the code is taken almost literally from 'luatex-hyphen.lua(CC0 license). The code I would have written would essentially be the same. Polyglossia also has its own (very similar) code to parse language.dat.lua. Babel doesn't use language.defandlanguage.dat.lua` at all.

For discussion:

  • Usability outside of minim? Why pull in callbacks? Simple callback function instead, like with \uselanguage@hook?
  • Register M.use_language to the callback instead of having \minim:uselanguage?
  • Theoretically (https://texdoc.org/serve/luatex-hyphen/0) the language.dat.lua format doesn't include all the needed information. In practice (TeX Live, MikTeX) it does. Perhaps the specification should be updated to make the fields we use mandatory, just to be sure.
  • What about these minim-pdf definitions that expect the usual \addlanguage/\uselanguage two step process:
% \newnamedlanguage {name} {lhm} {rhm}
\def\newnamedlanguage#1#2#3{%
    \expandafter\newlanguage\csname lang@#1\endcsname
    \expandafter\chardef\csname lhm@#1\endcsname=#2\relax
    \expandafter\chardef\csname rhm@#1\endcsname=#3\relax
    \csname lu@texhyphen@loaded@\the\csname lang@#1\endcsname\endcsname}

% \newnameddialect {language} {dialect}
\def\newnameddialect#1#2{%
    \expandafter\chardef\csname lang@#2\endcsname\csname lang@#1\endcsname
    \expandafter\chardef\csname lhm@#2\endcsname\csname lhm@#1\endcsname
    \expandafter\chardef\csname rhm@#2\endcsname\csname rhm@#1\endcsname}

Merge request reports