XLIFF filter should handle invalid XML characters better
It is an unfortunate fact of life that invalid numeric entities (such as & #x03; or & #x1F;) sometimes show up in XLIFF files, particularly (in my experience) SDLXLIFF. Nobody knows where they come from, but they break many XML parsers, including the one we use.
A better behavior would be to strip these unparsable characters as we encountered them. This would need to be done at the I/O level, before the content was handed to the XML parser.
Sample attached. You can see the failure by running
tikal.sh -fc okf_xliff -x invalid_xml_entity.xlf
We currently crash with this stack:
Illegal character entity: expansion character (code 0x3
at [row,col,system-id]: [7,33,"file:/Users/chase/Downloads/invalid_xml_entity.xlf"]
at net.sf.okapi.filters.xliff.its.ITSStandoffManager.parseXLIFF(ITSStandoffManager.java:112)
at net.sf.okapi.filters.xliff.XLIFFITSFilterExtension.parseInDocumentITSStandoff(XLIFFITSFilterExtension.java:79)
at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:396)
at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:316)
at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:309)