Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-1614

Oak Analyzer can't tokenize chinese phrases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.19
    • 0.20
    • None
    • None

    Description

      It looks like the WhitespaceTokenizer cannot properly split Chinese phrases, for example '美女衬衫'.
      I could not find a reference to this issue other than LUCENE-5096.

      The fix is to switch to the ClassicTokenizer which seems better equipped for this kind of task.

      Attachments

        Activity

          People

            stillalex Alex Deparvu
            stillalex Alex Deparvu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: