Add a WordContextDataset class and optimize the code to generate a dataset (!2) · Merge requests · Arbetsförmedlingen / taxonomy-dev / backend / nlp / simple-tokenizer

Jonas Östlund requested to merge optimized-dataset into master Jul 05, 2021

This code adds the following contributions:

A WordContextDataset class with Clojure interface in the jobtech-nlp.word-context-dataset namespace.
An optimized implementation for building a WordContextDataset.

A WordContextDataset is backed by a memory-mapped file and can therefore be huge, while providing random-access. It has an interface that makes it possible to treat it like other Clojure collections.

For comparison the new implementation takes about 20 seconds to build a dataset from the same data as it took 10 minutes before the optimization. To run the new implementation, call it with something like

clj -M:run-m context-opt ~/issues/tokenizer/data0/ tokenized-text.txt

To run the old implementation, call it with something like

clj -M:run-m context ~/issues/tokenizer/data0/ tokenized-text.txt

Add a WordContextDataset class and optimize the code to generate a dataset

Merge request reports