Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5110

DefaultICUTokenizerConfig should use the default ICU behavior for the Khmer script

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0
    • None
    • modules/other
    • None
    • New

    Description

      Recent versions of ICU have their own implementation for the tokenization of the Khmer script. Lucene should not be overriding ICU's behavior any more.

      I haven't tried the patch out, but the patch should look something like the following:

      $ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
      67,68d66
      < private static final BreakIterator thaiBreakIterator =
      < BreakIterator.getWordInstance(new ULocale("th_TH"));
      71,72d68
      < private static final BreakIterator khmerBreakIterator =
      < readBreakIterator("Khmer.brk");
      87d82
      < case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
      89d83
      < case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();

      and the Khmer.* files should be removed. ICU already does script specific tokenization these days. So the Thai one should not be needed either since ICU 50.

      Attachments

        Activity

          People

            Unassigned Unassigned
            grhoten George Rhoten
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: