Skip to content

Fix expression parsing

finesse importer requested to merge fix/expression-parsing into master

Fixes #127 (closed), #136 (closed), #137 (closed).

This is a rewrite of the lexer and parser to tokenise whitespace where appropriate. This seems to be the only reliable way to parse expressions like (1+1)-1 - explained why below. One critical change is that multi-line parameter lists (using ( and )) are no longer allowed on a single line - see below.

Problem

The current parser on the master branch tokenises (1+1)-1 as (, 1, +, 1, ), -1, i.e. it includes the sign of negative numbers in the number token and not as a separate operator token (required for unary minus operator precedence). Because the parser expects expressions to follow the form lhs OPERATOR rhs, this causes a syntax error. My first attempted fix was to stop matching the sign of numbers and instead force the sign to be tokenised as an operator, which would ensure the expression would get tokenised as (, 1, +, 1, ), -, 1, i.e. the bracketed expression and last 1 would be separated by the - operator. This however created a knock-on problem that parameter lists were again ambiguous - something that we tried to fix in !21 (merged). Because up until now whitespace is ignored in the parser (although not in the tokeniser), the parser would have no way of disambiguating (1+1)-1 from a parameter list 1 +1 -1 in e.g. laser l1 1 +1 -1.

Distilled to a single sentence, the problem with kat script grammar is that whitespace is sometimes a delimiter, and sometimes not. That's the root of all of these parsing issues.

Solution

The fix I found was to stop ignoring whitespace in most cases in the tokeniser, and add whitespace tokens to the parser rules only where they depend on it as a delimiter. This means that now the parser rules can define when they expect whitespace as part of their productions, rather than relying on the lexer to transparently enforce it in each of the various situations (e.g. in parameter lists, where we need whitespace delimiters, or in arrays and functions, where commas not whitespace is the delimiter).

In cases where whitespace is merely whitespace, not a delimiter, the whitespace is ignored and the parser never sees it. This helps to make the grammar unambiguous while still allowing whitespace wherever the user is allowed.

In hindsight this approach seems a more natural way to do it, though due to the limitations of the parsing algorithm I had to add a hack (see below).

Tokeniser hack

The parser only looks at the current and next tokens in the stream to determine which rule to apply. This has the limitation that this sort of rule is ambiguous:

@_(
    "value",
    "key_value_dict",
    "params WHITESPACE value",
    "params WHITESPACE key_value_dict"
)
def params(self, p):

Because a value can be a params, this means if an argument list is encountered in the form ( 1 2 ), it starts to parse this is 1 -> value -> params, 2 -> value -> params, then merges them with params WHITESPACE params -> params. At that point it sits at the . in ( 1 2. ) and sees a WHITESPACE token next. As it only looks at this next token, not further (where it would see a ) and know that this was the end of the parameter list), it expects that yet another value is coming and attempts to apply the params WHITESPACE value rule (or the similar key_value_dict rule). Once it goes past the WHITESPACE token it sees ) which is not a value or key_value_dict so it throws an error.

To handle this, we can add to the rules for value and key_value_dict to allow them to take appropriate action if they get a ), but at that point it becomes even more complex since we start to duplicate logic for how to handle ) across different productions. I also tried changing the regular expressions that match (, [, ) and ] to "eat up" any whitespace they have next to them. This I hoped would avoid a WHITESPACE token appearing before ) in ( 1 2 ), and avoid that argument list rule problem above. This however didn't handle the case where a user puts comments into multi-line parameter lists, since these would get skipped by the tokeniser but result in two WHITESPACE tokens being emitted either side of the comment, only one of which would get eaten up by the new regular expressions for ( or ). We could in theory have handled these subsequent whitespace tokens with a parser rule that merges them into one, but that would yet again increasing parser complexity (we already had 2000+ lines in the parser debug file, which describes the state machine - and about 100 states).

As an aside, this sort of ambiguity can probably be solved with a different algorithm, Earley, which as far as I understand back-tracks and tests all of the various branches in the parser tree to figure out which rule is best to apply instead of just using the next token as a guide. This is apparently however far slower than LALR (the one we use), and our parsing library SLY does not use or support it, so I decided to first attempt to fix it with what we currently use.

The hack I came up with was to add an intermediate layer between the tokeniser and parser which removes whitespace where appropriate. In our grammar, whitespace is only a delimiter when it separates parameters or namespaces and fields. In other cases, such as before or after parentheses or brackets, it is to be ignored. To avoid having to deal with all of these cases in the parser where whitespace is acting as a delimiter or to be ignored, this intermediate layer removes the whitespace wherever it is to be ignored. That leaves only WHITESPACE tokens where they are genuinely to be used as delimiters, so only a few rules in the parser need adjusted.

The hack is implemented in the form of a generator that takes the token stream (which is itself a generator) and yields each token as long as it isn't non-delimiting whitespace. See normalise_whitespace in tokeniser.py for the implementation.

Other fixes

A fix was also made related to #137 (closed), to dump expressions with parentheses to ensure intended precedence is respected. This can lead to verbose expressions in some cases, like: cos((2*pi)), but it still re-parses correctly. We can later add some intelligent de-bracketing behaviour to the generator.

Side effects

Due to changes to the parser and tokeniser, some places where previously the tokeniser would handle a syntax error are now handled by the parser, and vice versa. This means some error messages changed. We can work on making common user errors show meaningful error messages as we use the grammar more.

As mentioned above, expressions now get dumped with parentheses where they don't necessarily need them. We can fix this with some intelligent de-bracketing later.

New tests

Because I changed the way that numbers get parsed, I had to add a bunch more tests to check how various types of number get parsed. I added some fuzzing tests to check these get correctly parsed.

To-do

  • Change to enforce the need for a newline character when ( starts a multi-line parameter list.
  • Test that multi-line expressions are not allowed.
  • Test that multi-line and single-line parameter modes can't be combined.
  • Add tests for expressions with whitespace in and outside of parentheses.
  • Add more syntax tests for various combinations of whitespace.
  • Add more syntax tests for numbers with SI units.
  • Add more syntax tests for empty parameter lists, empty arrays, etc.
  • Add recursive fuzzing tests for arrays.
  • Remove temporary logging formatting, prints, etc.
  • Update the kat script grammar issue (#113 (closed)).
Edited by finesse importer

Merge request reports