Automatically generate prioritised lists of suggested new words for the SRS
There are far too many words in every language for an individual to learn, and some words are much more useful than others, so should be learnt first. This is why standardised tests have vocab lists, and textbooks teach you certain words in a certain order.
Unfortunately, standardised tests and textbooks are made for abstract, idealised learners who live in the fantasy worlds of textbook authors. The "better" lists are based on various frequency calculations, obtained from various corpora of real language use. Alas, this is only slightly better than the fantasy worlds of the textbook authors, because actual, real learners, who are actual, real human beings, have interests and participate in social contexts that will have markedly different priorities. Someone who is a fan of classical music and regularly goes to concerts (and would like to read reviews, etc.) is going to encounter markedly different vocab to an online gamer, who has never been to a classical music concert (or ever intends to).
For example, take the Chinese word ”后端“ - "backside" or, more relevantly, "backend (computing)". This word is not in either HSK or the frequency database and is not even in the Azure dictionary at all. So it is "very uncommon". Or is it? For a technophile (programmer) such as myself who reads solidot.org every day, it is very common indeed - in about 20% of the articles I read. Knowing this word is very important for me for understanding the articles I read on a daily basis.
One way this dealt with is by using specialised word lists such as "optometry" or "skateboarding" in various forms, including in an SRS. This has value but is still far from optimal, as it relies on either someone else's reading preferences/habits, or some averaged set of priorities over a given subject area.
How do we do better then? By looking at what content learners are actually consuming, and using that. Word X is not known by the learner and is in Y% of texts read by the learner over the last Z days. That is of high value to the learner RIGHT NOW, so gets added high up the priority list.
- An interesting extension might be to attempt to determine whether particular categories of word (parts of speech, common spoken vs common written, etc.) are more or less important for overall comprehension of the texts, and would receive a further boost. One example might be grammatical words not being important and nouns being more important (or the opposite!).
This should then interact intelligently with #14