[LUCENE-5205] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Later
Affects Version/s: None
Fix Version/s: None
Component/s: core/queryparser
Labels:
- patch

Description

This parser extends QueryParserBase and includes functionality from:

Classic QueryParser: most of its syntax
SurroundQueryParser: recursive parsing for "near" and "not" clauses.
ComplexPhraseQueryParser: can handle "near" queries that include multiterms (wildcard, fuzzy, regex, prefix),
AnalyzingQueryParser: has an option to analyze multiterms.

At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases.

Same as classic syntax:

term: test
fuzzy: roam~0.8, roam~2
wildcard: te?t, test*, t*st
regex: /[mb]oat/
phrase: "jakarta apache"
phrase with slop: "jakarta apache"~3
default "or" clause: jakarta apache
grouping "or" clause: (jakarta apache)
boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
multiple fields: title:lucene author:hatcher

Main additions in SpanQueryParser syntax vs. classic syntax:

Can require "in order" for phrases with slop with the ~> operator: "jakarta apache"~>3
Can specify "not near": "fever bieber"!~3,10 ::
find "fever" but not if "bieber" appears within 3 words before or 10 words after it.
Fully recursive phrasal queries with [ and ]; as in: [[jakarta apache]~3 lucene]~>4 ::
find "jakarta" within 3 words of "apache", and that hit has to be within four words before "lucene"
Can also use [] for single level phrasal queries instead of " as in: [jakarta apache]
Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"~3 :: find "apache" and then either "lucene" or "solr" within three words.
Can use multiterms in phrasal queries: "jakarta~1 ap*che"~2
Did I mention full recursion: [[jakarta~1 ap*che]~2 (solr~ /l[ou]+[cs][en]+/)]~10 :: Find something like "jakarta" within two words of "ap*che" and that hit has to be within ten words of something like "solr" or that "lucene" regex.
Can require at least x number of hits at boolean level: "apache AND (lucene solr tika)~2
Can use negative only query: -jakarta :: Find all docs that don't contain "jakarta"
Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!).

Trivial additions:

Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2)
Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)

This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search.

Until ~~LUCENE-2878~~ is closed, this might have a use for fans of SpanQuery.

Most of the documentation is in the javadoc for SpanQueryParser.

Any and all feedback is welcome. Thank you.

Until this is added to the Lucene project, I've added a standalone lucene-addons repo (with jars compiled for the latest stable build of Lucene) on github.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5205_improve_stop_word_handling.patch
11/Mar/14 01:57
19 kB
Tim Allison
LUCENE-5205-cleanup-tests.patch
28/Feb/14 20:17
122 kB
Tim Allison
LUCENE-5205_dateTestReInitPkgPrvt.patch
21/Feb/14 15:05
10 kB
Tim Allison
LUCENE-5205-date-pkg-prvt.patch
21/Feb/14 12:03
4 kB
Tim Allison
LUCENE-5205_smallTestMods.patch
20/Feb/14 16:29
3 kB
Tim Allison
LUCENE-5205.patch.gz
18/Feb/14 18:49
53 kB
Tim Allison
LUCENE-5205.patch.gz
31/Jan/14 19:19
52 kB
Tim Allison
LUCENE_5205.patch
25/Nov/13 20:20
260 kB
Tim Allison
patch.txt
31/Oct/13 16:03
114 kB
Tim Allison
SpanQueryParser_v1.patch.gz
12/Sep/13 17:52
20 kB
Tim Allison

Issue Links

is depended upon by

SOLR-5410 Solr wrapper for the SpanQueryParser in LUCENE-5205

Open

SOLR-5412 TermVariants from fuzzy and/or span search

Open

LUCENE-5758 Extend SpanQueryParser with positional joins

Resolved

relates to

LUCENE-1823 QueryParser with new features for Lucene 3

Open

LUCENE-1486 Wildcards, ORs etc inside Phrase queries

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 11 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 12/Sep/13 17:52

Updated:: 28/Aug/22 13:52

Resolved:: 07/Dec/15 17:38