Skip to content

A Perl script to remove HTML tags, foreign languages from the corpus -...

Ajith R requested to merge ajithramayyan/corpus:undefined into master

CleanCorpus.plA Perl script to remove HTML tags, foreign languages from the corpus - expected to make the count of words more reliable. Converts old chillu notation to newer atomic notation. Removes newlines where it is not preceded by period, interrogation mark or exclamation mark and adds newline after periods, interrogation mark and exclamation marks - expected to make counts of sentences more reliable.

Edited by Ajith R

Merge request reports