Multi-language support for spamcheck
The predominant language of issues is English, however, there are certainly issues created in other languages. Multiple languages caused a problem with both model training and inference. Spamcheck should be able to support an arbitrary number of languages.
Potential Ideas:
- Leverage the Google translate API to convert issues to English when we detect a different language. This would require minimal code changes in the spamcheck service but would add a lot of latency to calculate a spam verdict when the issue is non-English.
- Leverage Google translate API to translate model training data and train models for the most common languages. This solution would result in lower latency spam verdicts but be more complicated to implement. ML accuracy might also be affected by the translation phase.
- Use pre-trained models from huggingface to perform translation. This would likely be more cost-effective for translating training data but would add many gigabytes to the spamcheck image if we went this route for inference.
Edited by Ian Anderson