chunker: use tokenizer to determine chunk size limits
The following discussion from !13 (merged) should be addressed:
-
@jshobrook1 started a discussion: (+2 comments) What embedding model are we planning to use? Any chance we can use the tokenizer to ensure chunks don't exceed a token limit instead of a character limit? It would be safer to guarantee that a chunk will never exceed a token limit.