Skip to content

Rewrite parsing

Jonathan Neuhauser requested to merge rewrite-parser into main

Changes

  • Rewrite parsing

The result of this rewrite is faster and cleaner code, the parsing itself fits in about 400 lines plus some structs to hold the different operators and result types.

Design choices:

  • The base of all parsing is a StreamReader, which has in particular a yield_tokens function that handles nearly all low-level text processing (dealing with leading %_, string decoding and collecting).
  • Some containers, such as %%BeginData or /Binary choose to skip the tokenizer altogether and operate on the raw stream.
  • Files are parsed in one go, no backtracking is performed. This adds a little bit of overhead for files which don't use any definitions (palette, styles, filters, text) - but realistically, users are not going to import a lot of near-empty files.
  • Very few assumptions are made which object types are present in which container.
  • Lean intermediate object structure; additional syntactic sugar will be added to them as cached properties. (e.g. Setup).

By far the slowest part of the parsing is the tokenizer, but the performance seems acceptable. If we want to speed it up, the tokenizer is small and isolated and can easily be converted to a C extension (but I don't want to deal with the hassle of shipping platform-dependent wheels at this stage of the project).

As a result, all current files are parsed until the end. I haven't yet validated that everything ends up inside the correct container, but it generally looks correct and the parsing is simple enough to fix small issues - tested that while getting all the progress tests to pass.

Some of the old unit tests are already restored; the call structure in the other tests will be restored in a separate commit. I've preserved the old parser and tests for this in a _ prefixed subdirectory.

Checklist

  • Change is tested
  • Change is documented.
  • Licenses are up to date

Related

Merge request reports