Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.0
-
None
-
None
-
New
Description
Recent versions of ICU have their own implementation for the tokenization of the Khmer script. Lucene should not be overriding ICU's behavior any more.
I haven't tried the patch out, but the patch should look something like the following:
$ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
67,68d66
< private static final BreakIterator thaiBreakIterator =
< BreakIterator.getWordInstance(new ULocale("th_TH"));
71,72d68
< private static final BreakIterator khmerBreakIterator =
< readBreakIterator("Khmer.brk");
87d82
< case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
89d83
< case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();
and the Khmer.* files should be removed. ICU already does script specific tokenization these days. So the Thai one should not be needed either since ICU 50.