The mlmorph project addresses the Malayalam morphology. With the help of the analyser, it is possible to develop a functional spelling checker for Malayalam. Let us see how.
What is a spellchecker
Let us define spellchecker formally. Given a word W in Malayalam, which is either spelled correctly or not, the primary task of spellchecker spellcheck(W) is to return a boolean answer. True value indicates, W is spelled correctly, False value means the word is not spelled correctly.
Spelled correctly means, the spellchecker could identify the word based on its know words set M. M could be a static list of words or a list that the system can generate dynamically. Traditionally, M was a static list(See https://wiki.smc.org.in/Spellchecker) and did not work well. In this proposed approach, M is mlmorph analyser. Any word which mlmorph analyser can analyse is a known word to M. That means,
if w such that w ∈ M is a word spelled correctly.
Similarly any w such that w ∉ M is not correctly spelled word.
The second task of spellchecker is providing a set of alternate words WS which are correct words for the given W. let us denote them as ws1, ws2, ws3.. wsn and so on.
Note that any wsn in WS is a known word for M. ie, wsn ∈ M
So that is the suggestions(W) task.
As explained above, pass the word to the mlmorph analyser, if it accepts and analyse the word, return boolean True, otherwise false.
Python APIs are provided in mlmoprh to analyse and get results. A single word can have multiple analysis. So any non-zero number of analysis results will indicate True output for spellcheck(W)
spellcheck('കാട്ടരുവിയിലെ') => True
spellcheck('മഴമേഖങ്ങൾ') => False
spellcheck('നാലയിരം') => False
spellcheck('പോകാതിരുനു') => False
spellcheck('എവിടെയായിരുന്നു') => True
spellcheck('ഓടിപോയി') => False
If spellcheck(W) is False, then only this is meaningful. The first task is to find alternate words W1, W2 etc that are candidates for WS1, WS2. Once we have a set of candidate words, pass those candidate words to spellcheck. Candidate words that pass spellcheck is a suggestion.
In a typical spellchecker, these candidate words are generated from the wordlist M itself. The candidates are chosen based on a concept called Edit distance. It is based on the assumptions about spelling errors. In general, a spelling error can result from missing letter, extra letter, alternate letter or duplication of a letter. Edit distance tells how many such operation are required to get a candidate word from the original work w. All candidate words with edit distance less than a threshold is candidate words. The edit distance calculation is often based on https://en.wikipedia.org/wiki/Levenshtein_distance
But, there are multiple issues with this approach
The insertion, deletion, duplication, substitution operations are blindly followed without looking at the characteristics of actual spelling mistakes people make. I believe that the nature of spelling mistakes vary per script or language. For example, in Malayalam, the spelling mistakes are often due to phonetic confusability. മൃതംഗം, ഗൂഡാലോചന - in these words, the mistake is in using a phonetically similar letter. It is almost impossible that a person write മൃദംഗം as മൃദംംഗം or ഗൂഢാലോചന as ഗൂഢാലോതന. It is also very rare that people misspell the first letter of the word. In English, due to its spelling system, Weird-Wierd, recieve-receive are possible mistakes, but I dont think interchanging letters is a pattern in Malayalam spelling mistakes.
The candidate words are taken from wordlist, but if that wordlist is not finite, you cannot do this.
For Malayalam, I propose the following method to get candidate list to validate for suggestions.
Candidate generation strategies
Based on the nature of Malayalam spelling mistakes, from the word w generate a set of candidates based on following strategies. A strategy is a method of modification of w based on familar spelling mistakes of Malayalam
Vowel sign elongation - Elongate the short vowels present in the word. For example, if w is കോഴിക്കൊട്, this strategy outputs കോഴിക്കോട്
Vowel sign shortening - Shorten the long vowels present in the word. For example, if w is പൊരൂത്തം, this strategy outputs പൊരുത്തം
Vowel to vowel sign - Replace any vowels inside a word to its vowel sign. So കലഇകാലം gives കലികാലം
Normalize to atomic chillu - Replace any non-atomic chillu to atomic chillu
ൌ -> ൗ - Replace ൌ sign with ൗ
ൻറ -> ന്റ - Nta correction
ററ -> റ്റ - tta correction
Aspirated consonant to Unaspirated consonant - Example, ധ -> ദ, ഢ -> ഡ
Unaspirated consonant to aspirated consonant - Example, ദ ->ധ, ഡ ->ഢ
മൃദു -> ഖരം
ഖരം -> മൃദു
Delete one character at a time, except the first one. For example മുതതല -> മതല, മുതല, മുതല, മുതത
Insert a single space starting from second position onwords. Don't insert space bfore chillu, virama or vowel sign. മകൾമര്യാദ -> മകൾ മര്യാദ, മകൾമ ര്യാദ, മകൾമര്യാ ദ
Replace chillu with its consonant+virama form. So ർ -> ര്
Replace ര്, ല്, ണ്, റ്, ന്, ള്, മ് with its chillu forms. ര് -> ർ, ള് -> ൾ
Insert virama between two adjacent consonants. അദധ്വാനം -> അദ്ധ്വാനം
Consonant to geminated consonant, if the consonant does not has adjacent virama പച്ചതത്ത -> പച്ചത്തത്ത
Insert യ് before ക്കുക. Example വക്കുക - വയ്ക്കുക
Replace ഇ and ി with എ, െ respectively.
Replace ന്പ, ൻപ, ംപ, ംമ്പ with മ്പ
Replace ്ര with ൃ - ഹ്രുദയം - ഹൃദയം
Add missing ്യ after ്ര - സ്വാതന്ത്രം -> സ്വാതന്ത്ര്യം
Swap ്യ with ൃ and reverse. ക്യ <-> കൃ
Replace ന് with ം, മ് സന്ഘി->സംഘി
ഇ->എ as in ഇല-എല, വില-വെല, ചില-ചെല,
Note: Each strategy can give one or more candidates.
First pass: All candidates for all strategies are collected and each of them are validated with spellcheck. Candidates that results True from spellcheck is considered as a suggestions wsn
If first pass does not give any suggestions, Second pass is executed.
Second pass: For all candidates resulted from First pass, execute First pass. Basically this means, the word may have more than one types of spelling mistakes, resulting from more than one strategy listed above.
An example of a word that get a successful suggestion ആലത്തൂർ from Second pass is ആലതൂര്
I don't see a reason to go beyond second pass. If second pass does not result any suggestions, the w is a misspelled word with no suggestions.