Skip to content

Add a WordContextDataset class and optimize the code to generate a dataset

Jonas Östlund requested to merge optimized-dataset into master

This code adds the following contributions:

  • A WordContextDataset class with Clojure interface in the jobtech-nlp.word-context-dataset namespace.
  • An optimized implementation for building a WordContextDataset.

A WordContextDataset is backed by a memory-mapped file and can therefore be huge, while providing random-access. It has an interface that makes it possible to treat it like other Clojure collections.

For comparison the new implementation takes about 20 seconds to build a dataset from the same data as it took 10 minutes before the optimization. To run the new implementation, call it with something like

clj -M:run-m context-opt ~/issues/tokenizer/data0/ tokenized-text.txt

To run the old implementation, call it with something like

clj -M:run-m context ~/issues/tokenizer/data0/ tokenized-text.txt

Merge request reports