Description
This issue is from https://github.com/elastic/elasticsearch/issues/30739
smartcn analyzer can't handle SURROGATE char, Example:
Analyzer ca = new SmartChineseAnalyzer(); String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char TokenStream tokenStream = ca.tokenStream("", sentence); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = charTermAttribute.toString(); System.out.println(term); }
In the above code snippet will output:
? ?
and I have created a PATCH to try to fix this, please help review(since smartcn only support GBK char, so it's only just handle it as a single char).