Rewrite parsing
Changes
- Rewrite parsing
The result of this rewrite is faster and cleaner code, the parsing itself fits in about 400 lines plus some structs to hold the different operators and result types.
Design choices:
- The base of all parsing is a StreamReader, which has in particular
a
yield_tokens
function that handles nearly all low-level text processing (dealing with leading%_
, string decoding and collecting). - Some containers, such as
%%BeginData
or/Binary
choose to skip the tokenizer altogether and operate on the raw stream. - Files are parsed in one go, no backtracking is performed. This adds a little bit of overhead for files which don't use any definitions (palette, styles, filters, text) - but realistically, users are not going to import a lot of near-empty files.
- Very few assumptions are made which object types are present in which container.
- Lean intermediate object structure; additional syntactic sugar will
be added to them as cached properties. (e.g.
Setup
).
By far the slowest part of the parsing is the tokenizer, but the performance seems acceptable. If we want to speed it up, the tokenizer is small and isolated and can easily be converted to a C extension (but I don't want to deal with the hassle of shipping platform-dependent wheels at this stage of the project).
As a result, all current files are parsed until the end. I haven't
yet validated that everything ends up inside the correct container, but
it generally looks correct and the parsing is simple enough to fix small
issues - tested that while getting all the progress
tests to pass.
Some of the old unit tests are already restored; the call structure in
the other tests will be restored in a separate commit. I've preserved
the old parser and tests for this in a _
prefixed subdirectory.
Checklist
-
Change is tested
-
Change is documented.
-
Licenses are up to date