Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8125

emoji sequence support in ICUTokenizer

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • trunk, 7.4
    • None
    • None
    • New

    Description

      uax29 word break rules already know how to handle these correctly, we just need to assign them a token type.

      This is better than users trying to do this with custom rules (e.g. LUCENE-7916) because they are script-independent (common/inherited).

      Attachments

        1. LUCENE-8125.patch
          17 kB
          Robert Muir
        2. LUCENE-8125.patch
          19 kB
          Robert Muir
        3. LUCENE-8125.patch
          20 kB
          Robert Muir
        4. LUCENE-8125.patch
          19 kB
          Robert Muir
        5. LUCENE-8125.patch
          27 kB
          Robert Muir

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: