This is a rough plan on developing a Malayalam spellchecker on top of mlmorph.
Author: Santhosh Thottingal
Work has started. See https://gitlab.com/smc/mlmorph/tree/master/spellcheck. API available at https://morph.smc.org.in/api/spellcheck?word=കെരളത്തിലെ Consider joining :)
The inflection and agglutination in Malayalam cause complex and large number of words. They cannot be listed in a static wordlist. Hence wordlist based spellcheckers are not possible for Malayalam. See earlier approaches: https://thottingal.in/documents/MalayalamComputingChallenges.pdf and https://wiki.smc.org.in/Spellchecker
The mlmorph project addresses the Malayalam morphology. With the help of the analyser, it is possible to develop a functional spelling checker for Malayalam. Let us see how.
What is a spellchecker
Let us define spellchecker formally. Given a word
W in Malayalam, which is either spelled correctly or not, the primary task of spellchecker
spellcheck(W) is to return a boolean answer. True value indicates,
W is spelled correctly, False value means the word is not spelled correctly.
Spelled correctly means, the spellchecker could identify the word based on its know words set
M could be a static list of words or a list that the system can generate dynamically. Traditionally,
M was a static list(See https://wiki.smc.org.in/Spellchecker) and did not work well. In this proposed approach,
mlmorph analyser. Any word which
mlmorph analyser can analyse is a known word to
M. That means,
w such that
w ∈ M is a word spelled correctly.
w such that
w ∉ M is not correctly spelled word.
The second task of spellchecker is providing a set of alternate words
WS which are correct words for the given
W. let us denote them as
ws1, ws2, ws3.. wsn and so on.
Note that any
WS is a known word for
wsn ∈ M
So that is the
As explained above, pass the word to the mlmorph analyser, if it accepts and analyse the word, return boolean True, otherwise false.
Python APIs are provided in mlmoprh to analyse and get results. A single word can have multiple analysis. So any non-zero number of analysis results will indicate True output for
- spellcheck('കാട്ടരുവിയിലെ') => True
- spellcheck('മഴമേഖങ്ങൾ') => False
- spellcheck('നാലയിരം') => False
- spellcheck('പോകാതിരുനു') => False
- spellcheck('എവിടെയായിരുന്നു') => True
- spellcheck('ഓടിപോയി') => False
spellcheck(W) is False, then only this is meaningful. The first task is to find alternate words W1, W2 etc that are candidates for WS1, WS2. Once we have a set of candidate words, pass those candidate words to
spellcheck. Candidate words that pass
spellcheck is a suggestion.
In a typical spellchecker, these candidate words are generated from the wordlist
M itself. The candidates are chosen based on a concept called Edit distance. It is based on the assumptions about spelling errors. In general, a spelling error can result from missing letter, extra letter, alternate letter or duplication of a letter. Edit distance tells how many such operation are required to get a candidate word from the original work
w. All candidate words with edit distance less than a threshold is candidate words. The edit distance calculation is often based on https://en.wikipedia.org/wiki/Levenshtein_distance
But, there are multiple issues with this approach
- The insertion, deletion, duplication, substitution operations are blindly followed without looking at the characteristics of actual spelling mistakes people make. I believe that the nature of spelling mistakes vary per script or language. For example, in Malayalam, the spelling mistakes are often due to phonetic confusability. മൃതംഗം, ഗൂഡാലോചന - in these words, the mistake is in using a phonetically similar letter. It is almost impossible that a person write മൃദംഗം as മൃദംംഗം or ഗൂഢാലോചന as ഗൂഢാലോതന. It is also very rare that people misspell the first letter of the word. In English, due to its spelling system, Weird-Wierd, recieve-receive are possible mistakes, but I dont think interchanging letters is a pattern in Malayalam spelling mistakes.
- The candidate words are taken from wordlist, but if that wordlist is not finite, you cannot do this.
For Malayalam, I propose the following method to get candidate list to validate for suggestions.
Candidate generation strategies
Based on the nature of Malayalam spelling mistakes, from the word
w generate a set of candidates based on following strategies. A strategy is a method of modification of
w based on familar spelling mistakes of Malayalam
Vowel sign elongation - Elongate the short vowels present in the word. For example, if
wis കോഴിക്കൊട്, this strategy outputs കോഴിക്കോട്
Vowel sign shortening - Shorten the long vowels present in the word. For example, if
wis പൊരൂത്തം, this strategy outputs പൊരുത്തം
- Vowel to vowel sign - Replace any vowels inside a word to its vowel sign. So കലഇകാലം gives കലികാലം
- Normalize to atomic chillu - Replace any non-atomic chillu to atomic chillu
- ൌ -> ൗ - Replace ൌ sign with ൗ
- ൻറ -> ന്റ - Nta correction
- ററ -> റ്റ - tta correction
- Aspirated consonant to Unaspirated consonant - Example, ധ -> ദ, ഢ -> ഡ
- Unaspirated consonant to aspirated consonant - Example, ദ ->ധ, ഡ ->ഢ
- മൃദു -> ഖരം
- ഖരം -> മൃദു
- Delete one character at a time, except the first one. For example മുതതല -> മതല, മുതല, മുതല, മുതത
- Insert a single space starting from second position onwords. Don't insert space bfore chillu, virama or vowel sign. മകൾമര്യാദ -> മകൾ മര്യാദ, മകൾമ ര്യാദ, മകൾമര്യാ ദ
- Replace chillu with its consonant+virama form. So ർ -> ര്
- Replace ര്, ല്, ണ്, റ്, ന്, ള്, മ് with its chillu forms. ര് -> ർ, ള് -> ൾ
- Insert virama between two adjacent consonants. അദധ്വാനം -> അദ്ധ്വാനം
- Consonant to geminated consonant, if the consonant does not has adjacent virama പച്ചതത്ത -> പച്ചത്തത്ത
- Insert യ് before ക്കുക. Example വക്കുക - വയ്ക്കുക
- Replace ഇ and ി with എ, െ respectively.
- Replace ന്പ, ൻപ, ംപ, ംമ്പ with മ്പ
- Replace ്ര with ൃ - ഹ്രുദയം - ഹൃദയം
- Add missing ്യ after ്ര - സ്വാതന്ത്രം -> സ്വാതന്ത്ര്യം
- Swap ്യ with ൃ and reverse. ക്യ <-> കൃ
- Replace ന് with ം, മ് സന്ഘി->സംഘി
- ഇ->എ as in ഇല-എല, വില-വെല, ചില-ചെല,
Note: Each strategy can give one or more candidates.
First pass: All candidates for all strategies are collected and each of them are validated with
spellcheck. Candidates that results True from
spellcheck is considered as a suggestions
If first pass does not give any suggestions, Second pass is executed.
Second pass: For all candidates resulted from First pass, execute
First pass. Basically this means, the word may have more than one types of spelling mistakes, resulting from more than one strategy listed above.
An example of a word that get a successful suggestion ആലത്തൂർ from
Second pass is ആലതൂര്
I don't see a reason to go beyond second pass. If second pass does not result any suggestions, the
w is a misspelled word with no suggestions.
- Python is the recommended programming languages
- Use python api of mlmorph
- For strategies, use Strategy pattern in python. See https://sourcemaking.com/design_patterns/strategy/python/1 use for spellchecker
- Use test driven development.
- Do not develop any UI or editors. Just build the basic methods defined above.
- If you indent to work on this, first contact Santhosh Thottingal ([email protected]) to avoid possible duplication of efforts.