Add a WordContextDataset class and optimize the code to generate a dataset
This code adds the following contributions:
- A
WordContextDataset
class with Clojure interface in thejobtech-nlp.word-context-dataset
namespace. - An optimized implementation for building a WordContextDataset.
A WordContextDataset
is backed by a memory-mapped file and can therefore be huge, while providing random-access. It has an interface that makes it possible to treat it like other Clojure collections.
For comparison the new implementation takes about 20 seconds to build a dataset from the same data as it took 10 minutes before the optimization. To run the new implementation, call it with something like
clj -M:run-m context-opt ~/issues/tokenizer/data0/ tokenized-text.txt
To run the old implementation, call it with something like
clj -M:run-m context ~/issues/tokenizer/data0/ tokenized-text.txt