[LUCENE-8036] ShingleFilter should have an option to skip filler tokens (e.g. stop words) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Trivial
Resolution: Unresolved
Affects Version/s: 7.1
Fix Version/s: None
Component/s: modules/analysis
Labels:

Lucene Fields:

New

Description

ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:

1. _ brown fox
2. brown fox quickly
3. fox quickly jump
4. quickly jump _
5. jump _ _
6. _ _ lazy
7. _ lazy dog

We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
1. brown fox quickly
2. fox quickly jump
3. quickly jump lazy
4. jump lazy dog

To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)

I've attached a patch for the ShingleFilter class (getNextToken() method), ShingleFilterFactory and ShingleFilterTest clases.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-11604.patch
04/Nov/17 19:28
7 kB
Edans Sandes

Issue Links

is cloned by

SOLR-11605 ShingleFilter should have an option to skip filler tokens (e.g. stop words)

Resolved

is related to

SOLR-6468 Regression: StopFilterFactory doesn't work properly without deprecated enablePositionIncrements="false"

Open

Activity

People

Assignee:: Unassigned

Reporter:: Edans Sandes

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Nov/17 14:16

Updated:: 28/Aug/22 15:21

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified