[LUCENE-5110] DefaultICUTokenizerConfig should use the default ICU behavior for the Khmer script - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.0
Fix Version/s: None
Component/s: modules/other
Labels:
None

Lucene Fields:

New

Description

Recent versions of ICU have their own implementation for the tokenization of the Khmer script. Lucene should not be overriding ICU's behavior any more.

I haven't tried the patch out, but the patch should look something like the following:

$ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
67,68d66
< private static final BreakIterator thaiBreakIterator =
< BreakIterator.getWordInstance(new ULocale("th_TH"));
71,72d68
< private static final BreakIterator khmerBreakIterator =
< readBreakIterator("Khmer.brk");
87d82
< case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
89d83
< case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();

and the Khmer.* files should be removed. ICU already does script specific tokenization these days. So the Thai one should not be needed either since ICU 50.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: George Rhoten

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Jul/13 23:58

Updated:: 28/Aug/22 13:49