Draft: More direct language handling
Previous TeX engines had the limitation of being able to load hyphenation patterns only at format creation time - when running iniTeX. LuaTeX has no such limitation and by using Lua it is possible to load hyphenation patterns at any time.
Today virtually all hyphenation patterns and exceptions that have been used by
TeX users are distributed in the hyph-utf8
package. hyph-utf8
provides
patterns/exceptions also in the new UTF-8 encoded plain text files that are
preferred for LuaTeX.
TeXLive's approach is to provide hyphenation patterns/exceptions for each
language in a separate package. Each package then hooks itself using the
TeXLive execute AddHyphen
directive. An example for French:
execute AddHyphen \
name=french synonyms=patois,francais \
lefthyphenmin=2 righthyphenmin=2 \
file=loadhyph-fr.tex \
file_patterns=hyph-fr.pat.txt \
file_exceptions=
This information is also written to files used by eTeX language
mechanism, which is used by plain LuaTeX. This gets added to language.def
:
\addlanguage{french}{loadhyph-fr.tex}{}{2}{2}
and this is written to language.dat.lua
:
['french'] = {
loader = 'loadhyph-fr.tex',
lefthyphenmin = 2,
righthyphenmin = 2,
synonyms = { 'patois', 'francais' },
patterns = 'hyph-fr.pat.txt',
hyphenation = '',
},
etex.src
reads language.def
at format creation time. Listed languages are
registered and their hyphenation patterns loaded into the format. This enables
their use later with \uselanguage
.
As for LuaTeX it is even discouraged to load patterns into format, the
mechanism is changed by hyph-utf8
's own etex.src
. Instead of loading each
pattern or exception file on \addlanguage
, the language is only registred
and the files are loaded on first \uselanguage
. Both commands, actually use
Lua code in luatex-hyphen.lua
, which gets information from language.dat.lua
database.
But why use the language.def
file at all? It's situation with synonyms isn't all that great, and information about left/right hyphenmins is already present in language.dat.lua
.
The approach in this pull request separates language handling from minim-etex
and introduces minim-languages
(which depends on callbacks
, and alloc
). Most of the stuff happens in Lua, where \newlanguage
and \minim:uselanguage
are defined. Both use LuaTeX's lang.new()
function to allocate language numbers and not the classical TeX count register 19. There is no longer \addlanguage
.
The real \uselanguage
is defined from TeX and keeps \uselanguage@hook
, which actually was used by minim before. This pull request changes that, and instead defines a custom callback, which allows anybody to hook into \uselanguage
. For compatibility, the TeX hook is also kept. \minim:uselanguage
could in fact be omitted and instead just register into the callback.
Part of the code is taken almost literally from 'luatex-hyphen.lua(CC0 license). The code I would have written would essentially be the same. Polyglossia also has its own (very similar) code to parse
language.dat.lua. Babel doesn't use
language.defand
language.dat.lua` at all.
For discussion:
- Usability outside of minim? Why pull in callbacks? Simple callback function instead, like with
\uselanguage@hook
? - Register
M.use_language
to the callback instead of having\minim:uselanguage
? - Theoretically (https://texdoc.org/serve/luatex-hyphen/0) the
language.dat.lua
format doesn't include all the needed information. In practice (TeX Live, MikTeX) it does. Perhaps the specification should be updated to make the fields we use mandatory, just to be sure. - What about these
minim-pdf
definitions that expect the usual\addlanguage
/\uselanguage
two step process:
% \newnamedlanguage {name} {lhm} {rhm}
\def\newnamedlanguage#1#2#3{%
\expandafter\newlanguage\csname lang@#1\endcsname
\expandafter\chardef\csname lhm@#1\endcsname=#2\relax
\expandafter\chardef\csname rhm@#1\endcsname=#3\relax
\csname lu@texhyphen@loaded@\the\csname lang@#1\endcsname\endcsname}
% \newnameddialect {language} {dialect}
\def\newnameddialect#1#2{%
\expandafter\chardef\csname lang@#2\endcsname\csname lang@#1\endcsname
\expandafter\chardef\csname lhm@#2\endcsname\csname lhm@#1\endcsname
\expandafter\chardef\csname rhm@#2\endcsname\csname rhm@#1\endcsname}