Performance difference to NeMo results

Find out why there is a performance difference to the (greedy) results reported by NeMo.

The conversion of the models seems to be correct -> differences in the dataset or in post-processing steps?