Performance difference to NeMo results

Find out why there is a performance difference to the (greedy) results reported by NeMo.

The conversion of the models seems to be correct -> differences in the dataset or in post-processing steps?

Other projects like https://github.com/domcross/german-stt-evaluation or https://arxiv.org/pdf/2204.05617.pdf (table 4) again have different results...

Edited by DANBER