Skip to content

packrat parser

finesse importer requested to merge packrat into master

Fixes #113 (closed), #182 (closed), #240 (closed), #251 (closed)

Implements new katscript v3 grammar using a new "packrat" parser. See last comments in #240 (closed) for details.

  • Finish model compilation
    • Tokeniser
    • Parser
    • Builder
      • Graph filler
      • Resolver
      • Compiler
        • Handling of self-referencing parameters
        • Tests for self-referencing parameters
  • Finish generator
    • Graph -> tokens
    • Model -> graph
  • How do we want to handle symbols in info parameters? (See below)
  • Update syntax documentation
  • Developer documentation: overview of how the parser works, how to add new components, etc.
  • Finish documentation syntax generator (fix finesse_sphinx, could use a different generator)
  • Restore disabled unparse examples in docs/source/getting_started/simple_example.rst and docs/source/usage/updating_params.rst
  • Show arguments that cause errors in __init__ (see #246 (closed))
  • CLI: add parser debug CLI commands
  • Fuzzing tests: generate lines of valid script and check they parse/build

Differences in behaviour from Python API

As per the behaviour decided in the telecon that discussed !40 (closed) (noted there), specifying R=x.T in kat script (i.e. a reference to another parameter without the &) copies x.T's value to R directly during parsing and passes the value (now a float or whatever) into the Python constructor. Meanwhile, if using the Python API directly, if the user specifies R=x.T it should throw an error saying the user should specify either .value or .ref.

Changes to core

Model.run defaults to Noxaxis if not specified

Previously the parser set the analysis to Noxaxis() if it wasn't specified in kat script, but this of course didn't do anything for Python API users. I figured it made more sense for Model.run to handle this default so I moved the behaviour there.

Dependencies

Some checking tools like pyflakes had to be updated because I used some walrus operators. Development environments will need to be updated when this gets merged.

Open questions

Symbols in info parameters

In the example below, modulator's order parameter is an expression involving a symbol. However, order is not a model parameter, just an info parameter, so this will (probably) fail at the simulation stage. My opinion is that the parser should allow this since it's still possible to do via the Python API. We should probably then catch this earlier than the simulation, such as in the validation stage I discuss in #246 (closed).

var order 2
modulator mod1 10M 0.1 1+&order.value

Registering custom elements

KatSpec is no longer a singleton. This means it's slightly more difficult to register custom elements etc. than it was before. However, we might consider a different approach altogether for custom components, such as specifying their Python path / file path in the config. Needs discussed.

Motivation notes

Why use a packrat parser?

  • Infinite lookahead at the expense of more memory, but memory is cheap now. An entire program in memory is tiny compared to a photo taken with your smartphone...
  • One function per grammar rule; recursive stack based descent, in contrast to LALR etc. that build a parsing table (push-down automaton).
    • Lets you easily parse e.g. functions and statements: different only depending on what's inside the (). LALR would only look at the ( and not know what to do.
    • Useful features not found in LALR/etc.: lookahead assertions (like regex) (lets you deal with contextful grammars)
    • Conceptually simpler to debug. Shift/reduce conflicts in LALR can be pretty hard to figure out.
    • Allows left recursion in grammar rules, like key_value_list <- key_value_list ',' key_value_list. In LALR it would refuse to build the parser.
    • Syntax errors can be detected by implementing rules just like the real rules (e.g. detecting if args follow kwargs), don't have to work out from the error handler.
  • Blog posts/videos from Guido van Rossum:

Why use a new tokeniser?

  • Need to keep raw token values around for re-generation. SLY and friends usually throw this information away at some point, e.g. by converting a "string" into string or (str) 1 into (int) 1.
    • Also needed to be able to display token values in errors correctly. Tabs need assigned a given number of spaces.
  • Using the same container objects for tokens as for parser productions makes error messages easier to build.
  • Lots of "magic" happens in SLY's tokeniser that is hard to understand. New tokeniser is conceptually simpler.
Edited by finesse importer

Merge request reports