SRXSegmenter does not handle parts covered by previous match
Created by: Anonymous
Original issue 426 created by @ysavourel on 2014-12-05T16:07:00.000Z:
See https://groups.yahoo.com/neo/groups/okapitools/conversations/topics/4478 for details.
The second rule really means „.“, i.e. always break.
Then, yes, I think there is a problem in the code: we don’t check the rule on the parts of the text included in the previous match.
Changing the code in SRXSegmenter.java from this:
m = rule.pattern.matcher(codedText);
while ( m.find() ) {
int n = m.start()+m.group(1).length();
if ( n > codedText.length() ) continue;
To this:
m = rule.pattern.matcher(codedText);
int start = 0;
while ( m.find(start) ) {
int n = m.start()+m.group(1).length();
start++;
if ( n > codedText.length() ) continue;
Should resolve this.
But there is side effect in the Aligner step tests.