[OAK-3574] Query engine: support p=lowercase('x') and other function-based indexes - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.11, 1.6.0
Component/s: core, lucene, query
Labels:
None

Description

Currently, the query engine and indexes don't support function-based indexes for conditions of the form lowercase(property) = 'x'. This needs to be supported, specially for the Lucene property index.

Also important is lowercase(name()).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OAK-3574-lucene.patch
08/Sep/16 14:25
41 kB
Thomas Mueller
OAK-3574-lucene-2.patch
14/Sep/16 14:12
47 kB
Thomas Mueller

Issue Links

is related to

OAK-8166 Index definition with orderable property definitions with and without functions breaks index

Closed

OAK-6905 Query engine: support coalesce function as in-built method

Closed

OAK-9625 Support ordered index for first value of a multi-valued property, node name, and path

Closed

is required by

OAK-6905 Query engine: support coalesce function as in-built method

Closed

relates to

OAK-9388 QueryImply#canSortByIndex() explicitly returns false for function based indexes though it's actually supported.

Resolved

Activity

Ascending order - Click to sort in descending order

Thomas Mueller added a comment - 09/Jun/16 12:22 - edited

length(property) = 10 can be supported as well. Also, other comparison types, such as upper(property) > 'abc', length(property) < 10.

In theory, combined functions can be supported (upper(lower(property)), but the grammar doesn't support that, and such combinations rarely make sense. The condition length(name()) = 10 is not supported by the grammar.

Thomas Mueller added a comment - 09/Jun/16 12:22 - edited length(property) = 10 can be supported as well. Also, other comparison types, such as upper(property) > 'abc', length(property) < 10. In theory, combined functions can be supported (upper(lower(property)), but the grammar doesn't support that, and such combinations rarely make sense. The condition length(name()) = 10 is not supported by the grammar.

Thomas Mueller added a comment - 10/Jun/16 12:51

http://svn.apache.org/r1747703 (trunk)

Work in progress. The query engine now supports the relevant cases, and adds property restrictions as needed. What is missing is support on the Lucene index side. I don't think the property index needs to support this currently, even thought it could do that as well.

Thomas Mueller added a comment - 10/Jun/16 12:51 http://svn.apache.org/r1747703 (trunk) Work in progress. The query engine now supports the relevant cases, and adds property restrictions as needed. What is missing is support on the Lucene index side. I don't think the property index needs to support this currently, even thought it could do that as well.

Chetan Mehrotra added a comment - 10/Jun/16 13:12 - edited

thomasm It would be better if parsing of the encoded function semantics is done in PropertyRestriction itself

So instead of

+                if (name.startsWith(QueryConstants.FUNCTION_RESTRICTION_PREFIX)) {
+                    // TODO support function-based indexes
+                    continue;
+                }

One can say pr.isFunction() and then have pr.getArgument(index) to get the proper property name (which can be relative). Also for lower/upper was thinking that at Lucene level we should only support case insensitive index i.e. property values say stored always in lower case and comparison done after converting the value from query to lower case. QE can then enforce proper check as the resulting result set would be a super set

Chetan Mehrotra added a comment - 10/Jun/16 13:12 - edited thomasm It would be better if parsing of the encoded function semantics is done in PropertyRestriction itself So instead of + if (name.startsWith(QueryConstants.FUNCTION_RESTRICTION_PREFIX)) { + // TODO support function-based indexes + continue; + } One can say pr.isFunction() and then have pr.getArgument(index) to get the proper property name (which can be relative). Also for lower/upper was thinking that at Lucene level we should only support case insensitive index i.e. property values say stored always in lower case and comparison done after converting the value from query to lower case. QE can then enforce proper check as the resulting result set would be a super set

Thomas Mueller added a comment - 10/Jun/16 13:54

I understand that "case insensitive" queries are the most important use case, and that's the one we need to support first.

But function-based indexes are very flexible and can be used for other cases as well, for example indexing the length of a property (for example length of a binary), indexing formulas such as the area (x*y if you have properties x and y), and so on. So function-based indexes don't necessarily only have one property name, there can be multiple, so that "getArgument" doesn't always make sense. I know that currently, the query languages don't support complex expressions (not even length(name()), but that can easily be changed in the parser.

It would be best if keep some flexibility, and write the API so that future extensions are not awkward and break backward compatibility.

> for lower/upper was thinking that at Lucene level we should only support case insensitive index

I'm not sure what Lucene supports. Java supports String.compareToIgnoreCase, but I'm not sure if we can use that. For example assuming the condition is "upper(firstName) = 'john'", then there are no matches. It would be incorrect to use String.compareToIgnoreCase. Sure, we could log a warning that says the condition doesn't make much sense. But there can only be matches for "lower(firstName) = 'john'".

Thomas Mueller added a comment - 10/Jun/16 13:54 I understand that "case insensitive" queries are the most important use case, and that's the one we need to support first. But function-based indexes are very flexible and can be used for other cases as well, for example indexing the length of a property (for example length of a binary), indexing formulas such as the area (x*y if you have properties x and y), and so on. So function-based indexes don't necessarily only have one property name, there can be multiple, so that "getArgument" doesn't always make sense. I know that currently, the query languages don't support complex expressions (not even length(name()), but that can easily be changed in the parser. It would be best if keep some flexibility, and write the API so that future extensions are not awkward and break backward compatibility. > for lower/upper was thinking that at Lucene level we should only support case insensitive index I'm not sure what Lucene supports. Java supports String.compareToIgnoreCase, but I'm not sure if we can use that. For example assuming the condition is "upper(firstName) = 'john'", then there are no matches. It would be incorrect to use String.compareToIgnoreCase. Sure, we could log a warning that says the condition doesn't make much sense. But there can only be matches for "lower(firstName) = 'john'".

Thomas Mueller added a comment - 14/Jun/16 10:12

http://svn.apache.org/r1748364 (trunk; fixes broken integration tests)

Thomas Mueller added a comment - 14/Jun/16 10:12 http://svn.apache.org/r1748364 (trunk; fixes broken integration tests)

Thomas Mueller added a comment - 14/Jun/16 11:49

http://svn.apache.org/r1748390 (trunk; fixes a test case that depends on the Java version used)

Thomas Mueller added a comment - 14/Jun/16 11:49 http://svn.apache.org/r1748390 (trunk; fixes a test case that depends on the Java version used)

Thomas Mueller added a comment - 08/Sep/16 13:36

http://svn.apache.org/r1759820 (oak-core, mainly refactoring)
http://svn.apache.org/r1759821 (oak-lucene, don't fail tests if only function conditions are there)

Thomas Mueller added a comment - 08/Sep/16 13:36 http://svn.apache.org/r1759820 (oak-core, mainly refactoring) http://svn.apache.org/r1759821 (oak-lucene, don't fail tests if only function conditions are there)

Thomas Mueller added a comment - 08/Sep/16 14:25 - edited

~~OAK-3574~~-lucene.patch: patch for the Lucene index to support function based indexes.

Tests: there are new test for this feature, and changes to integration test. For existing integration tests, the relevant function-based indexes are created. If an integration test uses a function restriction that is not indexes, the test fails because the index runs in "test mode".

Index configuration: similar to indexing a property. For the condition (XPath) "fn:upper-case(@lastName) = 'x'", create a new node under the relevant Lucene index (compatVersion 2), as follows. To use it for "order by fn:upper-case(@lastName)", use "ordered":

/oak:index/assetType
  + indexRules
    + nt:base
      + properties
        + upperLastName
          - function = "fn:upper-case(@lastName)"
          - ordered = true

The syntax of the "function" property is similar to the relevant condition in the query. Both XPath and SQL are supported (even thought I would probably use XPath in practise). Examples for property "function":

fn:upper-case(@data)
fn:lower-case(test/@data)
fn:lower-case(fn:name())
fn:lower-case(fn:local-name())
fn:string-length(test/@data)
upper([data])
lower([test/data])
lower(name())
lower(localname())
length([test/data])
length(name())

Indexing multi-valued properties are supported. Relative properties are supported (except for ".." and "."). Range conditions are supported ('>', '>=', '<=', '<').

chetanm could you please review the patch?

Thomas Mueller added a comment - 08/Sep/16 14:25 - edited OAK-3574 -lucene.patch: patch for the Lucene index to support function based indexes. Tests: there are new test for this feature, and changes to integration test. For existing integration tests, the relevant function-based indexes are created. If an integration test uses a function restriction that is not indexes, the test fails because the index runs in "test mode". Index configuration: similar to indexing a property. For the condition (XPath) "fn:upper-case(@lastName) = 'x'", create a new node under the relevant Lucene index (compatVersion 2), as follows. To use it for "order by fn:upper-case(@lastName)", use "ordered": /oak:index/assetType + indexRules + nt:base + properties + upperLastName - function = "fn:upper-case(@lastName)" - ordered = true The syntax of the "function" property is similar to the relevant condition in the query. Both XPath and SQL are supported (even thought I would probably use XPath in practise). Examples for property "function": fn:upper-case(@data) fn:lower-case(test/@data) fn:lower-case(fn:name()) fn:lower-case(fn:local-name()) fn:string-length(test/@data) upper([data]) lower([test/data]) lower(name()) lower(localname()) length([test/data]) length(name()) Indexing multi-valued properties are supported. Relative properties are supported (except for ".." and "."). Range conditions are supported ('>', '>=', '<=', '<'). chetanm could you please review the patch?

Chetan Mehrotra added a comment - 13/Sep/16 09:32

This looks much more powerful now!

String name = functionIndex.function;
String f = FunctionIndexParser.convertToPolishNotation(name);

Instead of computing the polish notation in IndexPlanner this can be computed once and stored in PropertyDefinition itself.

Currently the function evaluation logic in LuceneIndexEditor is handling relative property itself. This would work fine when the node gets indexed for first time or if some non function property got changed. But it would not work if the relative property which is encoded in function gets changed as editor would not be aware of that.

Instead of that it would be better to use a separate field in property definition to capture function and then have property name determined from that. Then

In IndexDefinition#collectPropConfigs have the loop evaluate pd.relative part as that would then register this relative property.
And then you can implement the function evaluation logic in LuceneIndexEditor#indexProperty

Chetan Mehrotra added a comment - 13/Sep/16 09:32 This looks much more powerful now! String name = functionIndex.function; String f = FunctionIndexParser.convertToPolishNotation(name); Instead of computing the polish notation in IndexPlanner this can be computed once and stored in PropertyDefinition itself. Currently the function evaluation logic in LuceneIndexEditor is handling relative property itself. This would work fine when the node gets indexed for first time or if some non function property got changed. But it would not work if the relative property which is encoded in function gets changed as editor would not be aware of that. Instead of that it would be better to use a separate field in property definition to capture function and then have property name determined from that. Then In IndexDefinition#collectPropConfigs have the loop evaluate pd.relative part as that would then register this relative property. And then you can implement the function evaluation logic in LuceneIndexEditor#indexProperty

Thomas Mueller added a comment - 14/Sep/16 14:01

> this can be computed once

You are right! I will fix that.

> not work if the relative property which is encoded in function gets changed

Good catch! I will add a test case and fix this. I will add the relative property / properties in collectPropConfigs to the propAggregate list. That way, a function can potentially have multiple parameters in the future (for example, a function that calculates the duration based on start and end date).

Thomas Mueller added a comment - 14/Sep/16 14:01 > this can be computed once You are right! I will fix that. > not work if the relative property which is encoded in function gets changed Good catch! I will add a test case and fix this. I will add the relative property / properties in collectPropConfigs to the propAggregate list. That way, a function can potentially have multiple parameters in the future (for example, a function that calculates the duration based on start and end date).

Thomas Mueller added a comment - 14/Sep/16 14:12

New patch with fixed issues. The code processing functions should now be a bit more encapsulated and easier to re-use (class FunctionIndexProcessor). Processing now uses a simple stack machine, so it should be easier to extend.

chetanm could you review again please?

Thomas Mueller added a comment - 14/Sep/16 14:12 New patch with fixed issues. The code processing functions should now be a bit more encapsulated and easier to re-use (class FunctionIndexProcessor). Processing now uses a simple stack machine, so it should be easier to extend. chetanm could you review again please?

Thomas Mueller added a comment - 16/Sep/16 12:03

http://svn.apache.org/r1761025 (trunk) - committing the current patch, please revert if there is a problem

Thomas Mueller added a comment - 16/Sep/16 12:03 http://svn.apache.org/r1761025 (trunk) - committing the current patch, please revert if there is a problem

Davide Giannella added a comment - 04/Oct/16 08:48

Bulk close for 1.5.11

Davide Giannella added a comment - 04/Oct/16 08:48 Bulk close for 1.5.11

People

Assignee:: Thomas Mueller

Reporter:: Thomas Mueller

Votes:: 13 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Oct/15 15:27

Updated:: 17/Apr/23 14:05

Resolved:: 16/Sep/16 12:03

Jackrabbit Oak

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates