SRXSegmenter does not handle parts covered by previous match

Created by: Anonymous

Original issue 426 created by @ysavourel on 2014-12-05T16:07:00.000Z:

See https://groups.yahoo.com/neo/groups/okapitools/conversations/topics/4478 for details.

The second rule really means „.“, i.e. always break.

Then, yes, I think there is a problem in the code: we don’t check the rule on the parts of the text included in the previous match.

Changing the code in SRXSegmenter.java from this:

m = rule.pattern.matcher(codedText);
while ( m.find() ) {
int n = m.start()+m.group(1).length();
if ( n > codedText.length() ) continue;

To this:

m = rule.pattern.matcher(codedText);
int start = 0;
while ( m.find(start) ) {
int n = m.start()+m.group(1).length();
start++;
if ( n > codedText.length() ) continue;

Should resolve this.

But there is side effect in the Aligner step tests.

Assignee Loading
Time tracking Loading