Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The "BreakIterator" implementation inside the UnifiedHighlighter can play a significant role in the performance of highlighting. The default ones are based in the JDK and thus we don't have control over them but they may very well be optimized but have a complicated job to do. I propose that the break locations be computed at indexing time in a Solr UpdateRequestProcessor and place them into a pre analyzed common field named maybe _highlighter_breaks_ that needs indexed=true plus offsets. In this field, the term is the actual field name, the position is meaningless, and the offset pair refers to the span of the break iterator (typically a sentence). This data can be efficiently stored in Lucene. The UnifiedHighlighter already has a flexible BreakIterator producer but it's not notified of the current document, and so changes would be needed there (separate LUCENE issue).